Code
library(readxl)
library(dplyr)
library(magrittr)
library(alr4)
library(smss)
library(ggplot2)
library(stargazer)
::opts_chunk$set(echo = TRUE) knitr
Jerin Jacob
May 11, 2023
Question 1: A) Finding the predicted house price. We are writing a function to find y_hat using the parameters x1 and x2.
The predicted selling price is $107296
The residual is 37704
The residual is positive which means the model under predicted the selling price considerably high difference.
For a fixed lot size, the selling price predicted will increase by $53.8 for every unit increase of home size because it is the coefficient of the size of home variable when lot size is also in the model.
For one square foot increase, the house price increase $53.8, keeping lot size constant. For one unit increase in lot size, house price increase $2.84. So to make an equal change in home price that of increase in home size, the lot size should be increased 53.8/2.84 times.
Question 2:
Loading the data
degree rank sex year ysdeg salary
1 Masters Prof Male 25 35 36350
2 Masters Prof Male 13 22 35350
3 Masters Prof Male 10 23 28200
4 Masters Prof Female 7 27 26775
5 PhD Prof Male 19 30 33696
6 Masters Prof Male 16 21 28516
To test the hypothesis that the mean salary for men and women is the same without regard to any other variable, we can run a simple linear regression model with sex as the only explanatory variable. This is equivalent to doing a two-sample t-test for salary for Male and Female groups
Call:
lm(formula = salary ~ sex, data = salary)
Residuals:
Min 1Q Median 3Q Max
-8602.8 -4296.6 -100.8 3513.1 16687.9
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24697 938 26.330 <2e-16 ***
sexFemale -3340 1808 -1.847 0.0706 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5782 on 50 degrees of freedom
Multiple R-squared: 0.0639, Adjusted R-squared: 0.04518
F-statistic: 3.413 on 1 and 50 DF, p-value: 0.0706
The p-value is less than .1. Therefore we can reject the null hypothesis, which means that the test is significant at 10% significance level and the means are not same. The coefficient of sexFemale is -3340 which means that for all other variables kept constant, there will be a decrease of $3340 in female salary compared to that of male salary.
Running a multiple linear regression with salary as the outcome variable and all other variables as predictors with confidence interval 95%.
2.5 % 97.5 %
(Intercept) 14134.4059 17357.68946
degreePhD -663.2482 3440.47485
rankAssoc 2985.4107 7599.31080
rankProf 8396.1546 13841.37340
sexFemale -697.8183 3030.56452
year 285.1433 667.47476
ysdeg -280.6397 31.49105
95% confidence interval for the sexFemale variable is (-697, 3031). This suggests a confidence interval between $697 less or $3031 more salary for female faculty relative to male faculty, controlling for other variables.
Call:
lm(formula = salary ~ ., data = salary)
Residuals:
Min 1Q Median 3Q Max
-4045.2 -1094.7 -361.5 813.2 9193.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15746.05 800.18 19.678 < 2e-16 ***
degreePhD 1388.61 1018.75 1.363 0.180
rankAssoc 5292.36 1145.40 4.621 3.22e-05 ***
rankProf 11118.76 1351.77 8.225 1.62e-10 ***
sexFemale 1166.37 925.57 1.260 0.214
year 476.31 94.91 5.018 8.65e-06 ***
ysdeg -124.57 77.49 -1.608 0.115
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared: 0.855, Adjusted R-squared: 0.8357
F-statistic: 44.24 on 6 and 45 DF, p-value: < 2.2e-16
Degree: This variable is not statistically significant as the p-value is .180 which is over any conventially accepted significance level.It is a dummy variable. The coefficient suggests that the PhDs make $1388.61 more than those with Master’s controlling for everything else.
Rank: The model takes rank as a categorical variable, ignoring its order. Usually, for ordered categorical variables like rank, it is either treated as just a regular categorical variable or as a numeric variable. When the variable has a lot of levels or the distances between each level are reasonanly equal, it is commonly considered as numeric. But in this case, since there are only 3 levels, it can be accepted as a regular categorical variable.
For rank, the base/reference level is Assistant Professors. rankAssoc suggests that Associate Professors make $5292 more than the base level and rankProf suggests Professors make $11118.76 more than base level. Variables are statistically significant.
To test the statistical significance of the rank variable as a whole, rather than the individual dummy variables, we can do a partial F-test to compare the model with all the variables to the one without any rank dummies. The easiest way to do this is;
Analysis of Variance Table
Model 1: salary ~ degree + rank + sex + year + ysdeg
Model 2: salary ~ -rank
Res.Df RSS Df Sum of Sq F Pr(>F)
1 45 258858365
2 51 1785729858 -6 -1526871493 44.239 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The very low p-value suggests that the rank variable is significant as a whole.
Sex: The p-value suggests that this variable is not significant at conventional levels. The coefficient suggests that female faculty make $1166 more after controlling all other variables, but interpreting coefficient when the effect is insignificant is not meaningful.
Year: Is statistically significant and the coefficient suggests that for every year increased, there is an associated increase of $476 in salary.
ysdeg: The variable is insignificant. The coefficient suggests that every additional year that passes since degree is associated with $124 less salary.
Call:
lm(formula = salary ~ ., data = salary)
Residuals:
Min 1Q Median 3Q Max
-4045.2 -1094.7 -361.5 813.2 9193.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26864.81 1375.29 19.534 < 2e-16 ***
degreePhD 1388.61 1018.75 1.363 0.180
rankAsst -11118.76 1351.77 -8.225 1.62e-10 ***
rankAssoc -5826.40 1012.93 -5.752 7.28e-07 ***
sexFemale 1166.37 925.57 1.260 0.214
year 476.31 94.91 5.018 8.65e-06 ***
ysdeg -124.57 77.49 -1.608 0.115
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared: 0.855, Adjusted R-squared: 0.8357
F-statistic: 44.24 on 6 and 45 DF, p-value: < 2.2e-16
Now, the reference level for rank is Professors and therefore the coeffecients for other levels of rank has been changed accordingly. Assistant Professors make $11118.78 less than that of Professors controlling for other variables. Associate professors make $5826 less than Professors, controlling for all other variables.
Call:
lm(formula = salary ~ . - rank, data = salary)
Residuals:
Min 1Q Median 3Q Max
-8146.9 -2186.9 -491.5 2279.1 11186.6
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17183.57 1147.94 14.969 < 2e-16 ***
degreePhD -3299.35 1302.52 -2.533 0.014704 *
sexFemale -1286.54 1313.09 -0.980 0.332209
year 351.97 142.48 2.470 0.017185 *
ysdeg 339.40 80.62 4.210 0.000114 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3744 on 47 degrees of freedom
Multiple R-squared: 0.6312, Adjusted R-squared: 0.5998
F-statistic: 20.11 on 4 and 47 DF, p-value: 1.048e-09
After removing the rank variable, the regression analysis coefficients chanhged significantly. Now, the coefficient of degreePhD is negative(-3299) which suggests that a faculty with a PhD degree will be paid $3299 less than non-PhD faculty, controlling for other variables. The sign for sex and ysdeg also flipped but are statistically significant.
Pearson's product-moment correlation
data: salary$newDean_hire and salary$ysdeg
t = -11.101, df = 50, p-value = 4.263e-15
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.9074548 -0.7411040
sample estimates:
cor
-0.8434239
The two variables are highly correlated, negatively. So, when we run the regression, we can exclude ysdeg and only include newDean_hire instead, along with other control variables.
Call:
lm(formula = salary ~ . - ysdeg, data = salary)
Residuals:
Min 1Q Median 3Q Max
-3403.3 -1387.0 -167.0 528.2 9233.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24425.32 1107.52 22.054 < 2e-16 ***
degreePhD 818.93 797.48 1.027 0.3100
rankAsst -11096.95 1191.00 -9.317 4.54e-12 ***
rankAssoc -6124.28 1028.58 -5.954 3.65e-07 ***
sexFemale 907.14 840.54 1.079 0.2862
year 434.85 78.89 5.512 1.65e-06 ***
newDean_hire 2163.46 1072.04 2.018 0.0496 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2362 on 45 degrees of freedom
Multiple R-squared: 0.8594, Adjusted R-squared: 0.8407
F-statistic: 45.86 on 6 and 45 DF, p-value: < 2.2e-16
The newDean_hire variable’s coefficient is 2163 and it is statistically significant at a significance level of .05%. It suggests that the dean was giving generous offers to those people who have 15 years or less after their highest degree.
Just checking how the regression will perform if we included both the variable!
Call:
lm(formula = salary ~ ., data = salary)
Residuals:
Min 1Q Median 3Q Max
-3621.2 -1336.8 -271.6 530.1 9247.6
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 25179.14 1901.59 13.241 < 2e-16 ***
degreePhD 1135.00 1031.16 1.101 0.277
rankAsst -11411.45 1362.02 -8.378 1.16e-10 ***
rankAssoc -6177.44 1043.04 -5.923 4.39e-07 ***
sexFemale 1084.09 921.49 1.176 0.246
year 460.35 95.09 4.841 1.63e-05 ***
ysdeg -47.86 97.71 -0.490 0.627
newDean_hire 1749.09 1372.83 1.274 0.209
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2382 on 44 degrees of freedom
Multiple R-squared: 0.8602, Adjusted R-squared: 0.838
F-statistic: 38.68 on 7 and 44 DF, p-value: < 2.2e-16
Now, both the variables are not significant because of the multicollinearity.
Question 3: Loading the data
case Taxes Beds Baths New Price Size
1 1 3104 4 2 0 279900 2048
2 2 1173 2 1 0 146500 912
3 3 3076 4 2 0 237700 1654
4 4 1608 3 2 0 200000 2068
5 5 1454 3 3 0 159900 1477
6 6 2997 3 2 1 499900 3153
Call:
lm(formula = Price ~ Size + New, data = house.selling.price)
Residuals:
Min 1Q Median 3Q Max
-205102 -34374 -5778 18929 163866
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -40230.867 14696.140 -2.738 0.00737 **
Size 116.132 8.795 13.204 < 2e-16 ***
New 57736.283 18653.041 3.095 0.00257 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared: 0.7226, Adjusted R-squared: 0.7169
F-statistic: 126.3 on 2 and 97 DF, p-value: < 2.2e-16
Both Size and New are statistically significant variables. When one square foot of size increase there will be $116 increase in the house price, controlling for other variable. The new houses will have $57736 more in price than that of old house, controling for other variable.
For New variable, the values are binary which means it takes 1 as value when the house is new and 0 when it is an old house.
The prediction equation for new houses is : E[Price] = -40230.867 + 116.132 * size + 57736.283 * 1. ie, E[Price] = 17505.42 + 116.132 * size
The prediction equation for old houses is : E[Price] = -40230.867 + 116.132 * size + 57736.283 * 0 = -40230.867 + 116.132 * size
# One way of finding the predicted selling price
size <- 3000
new <- 1
not_new <- 0
house_price_new <- -40230.867 + 116.132 * size + 57736.283 * new
house_price_not_new <- -40230.867 + 116.132 * size + 57736.283 * not_new
# Print the results
cat("Price for new houses with 3000 sqft size is: [", house_price_new, "]\n")
Price for new houses with 3000 sqft size is: [ 365901.4 ]
Price for houses with 3000 sqft size that are not new is: [ 308165.1 ]
1 2
365900.2 308163.9
Call:
lm(formula = Price ~ Size + New + Size * New, data = house.selling.price)
Residuals:
Min 1Q Median 3Q Max
-175748 -28979 -6260 14693 192519
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -22227.808 15521.110 -1.432 0.15536
Size 104.438 9.424 11.082 < 2e-16 ***
New -78527.502 51007.642 -1.540 0.12697
Size:New 61.916 21.686 2.855 0.00527 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 52000 on 96 degrees of freedom
Multiple R-squared: 0.7443, Adjusted R-squared: 0.7363
F-statistic: 93.15 on 3 and 96 DF, p-value: < 2.2e-16
After adding the interaction between size and new, the coefficient of New reversed the sign but is not statistically significant now. The interaction term is statistically significant at a significance level of 1%.
The prediction equation is: E[Price] = -22227.808 + 104.438 * Size - 78527.502 * New + 61.916 * Size*New
For new house, E[Price] = -22227.808 + 104.438 * Size - 78527.502 * 1 + 61.916 * Size1 = -100755.3 + 166.354 Size
For old house, E[Price] = -22227.808 + 104.438 * Size - 78527.502 * 0 + 61.916 * Size0 = -22227.808 + 104.438 Size
Price for houses with 3000 sqft size that are new and not new are: [ 398307.5 291087.4 ] consecutively
New home is $107,220 pricier than old houses of 3000 sqft.
Price for houses with 1500 sqft size that are new and not new are: [ 148776.1 134429.8 ] consecutively
New home with 1500 sqft has a predicted price of $148776. Old home with the same size has a predicted price of $134429. New home is $14,347 pricier than old houses of 1500 sqft.
Comparing the differences of prices between new and old houses in F and G, we can conclude that bigger houses have much bigger difference in price between new and old ones.
Call:
lm(formula = Price ~ Size + New, data = house.selling.price)
Residuals:
Min 1Q Median 3Q Max
-205102 -34374 -5778 18929 163866
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -40230.867 14696.140 -2.738 0.00737 **
Size 116.132 8.795 13.204 < 2e-16 ***
New 57736.283 18653.041 3.095 0.00257 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared: 0.7226, Adjusted R-squared: 0.7169
F-statistic: 126.3 on 2 and 97 DF, p-value: < 2.2e-16
Call:
lm(formula = Price ~ Size + New + Size * New, data = house.selling.price)
Residuals:
Min 1Q Median 3Q Max
-175748 -28979 -6260 14693 192519
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -22227.808 15521.110 -1.432 0.15536
Size 104.438 9.424 11.082 < 2e-16 ***
New -78527.502 51007.642 -1.540 0.12697
Size:New 61.916 21.686 2.855 0.00527 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 52000 on 96 degrees of freedom
Multiple R-squared: 0.7443, Adjusted R-squared: 0.7363
F-statistic: 93.15 on 3 and 96 DF, p-value: < 2.2e-16
The preferred model is with the intereaction term because the interaction term is significant and the Adjusted R-squared is higher in the model with interaction term. The higher Adjusted R-squared means that despite penalizing for the additional term, we have a better fit model.
---
title: "Homework 4"
author: "Jerin Jacob"
desription: "Homework 4- 603 Spring 2023"
date: "05/11/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- hw4
- challenge4
- Jerin Jacob
---
```{r}
#| label: setup
#| warning: false
library(readxl)
library(dplyr)
library(magrittr)
library(alr4)
library(smss)
library(ggplot2)
library(stargazer)
knitr::opts_chunk$set(echo = TRUE)
```
Question 1:
A)
Finding the predicted house price. We are writing a function to find y_hat using the parameters x1 and x2.
```{r}
y_hat <- function(x1,x2) {
-10536 + 53.8*x1 + 2.84*x2
}
predicted_price <- y_hat(1240, 18000)
cat('The predicted selling price is $', predicted_price, sep = '')
```
```{r}
Actual_price <- 145000
cat('The residual is', Actual_price - predicted_price)
```
The residual is positive which means the model under predicted the selling price considerably high difference.
B)
For a fixed lot size, the selling price predicted will increase by $53.8 for every unit increase of home size because it is the coefficient of the size of home variable when lot size is also in the model.
C)
For one square foot increase, the house price increase $53.8, keeping lot size constant. For one unit increase in lot size, house price increase $2.84. So to make an equal change in home price that of increase in home size, the lot size should be increased 53.8/2.84 times.
Question 2:
Loading the data
```{r}
data("salary", package = "alr4")
head(salary)
```
To test the hypothesis that the mean salary for men and women is the same without regard to any other variable, we can run a simple linear regression model with sex as the only explanatory variable. This is equivalent to doing a two-sample t-test for salary for Male and Female groups
```{r}
summary(lm(salary ~ sex, data = salary))
```
The p-value is less than .1. Therefore we can reject the null hypothesis, which means that the test is significant at 10% significance level and the means are not same. The coefficient of sexFemale is -3340 which means that for all other variables kept constant, there will be a decrease of $3340 in female salary compared to that of male salary.
B)
Running a multiple linear regression with salary as the outcome variable and all other variables as predictors with confidence interval 95%.
```{r}
lm(salary ~ ., data = salary) |>
confint()
```
95% confidence interval for the sexFemale variable is (-697, 3031). This suggests a confidence interval between $697 less or $3031 more salary for female faculty relative to male faculty, controlling for other variables.
C)
```{r}
summary(lm(salary ~ ., data = salary))
```
Degree: This variable is not statistically significant as the p-value is .180 which is over any conventially accepted significance level.It is a dummy variable. The coefficient suggests that the PhDs make $1388.61 more than those with Master's controlling for everything else.
Rank: The model takes rank as a categorical variable, ignoring its order. Usually, for ordered categorical variables like rank, it is either treated as just a regular categorical variable or as a numeric variable. When the variable has a lot of levels or the distances between each level are reasonanly equal, it is commonly considered as numeric. But in this case, since there are only 3 levels, it can be accepted as a regular categorical variable.
For rank, the base/reference level is Assistant Professors. rankAssoc suggests that Associate Professors make $5292 more than the base level and rankProf suggests Professors make $11118.76 more than base level. Variables are statistically significant.
To test the statistical significance of the rank variable as a whole, rather than the individual dummy variables, we can do a partial F-test to compare the model with all the variables to the one without any rank dummies. The easiest way to do this is;
```{r}
fit1 <- lm(salary ~ ., data = salary)
fit2 <- lm(salary ~ -rank, data = salary)
anova(fit1, fit2)
```
The very low p-value suggests that the rank variable is significant as a whole.
Sex: The p-value suggests that this variable is not significant at conventional levels. The coefficient suggests that female faculty make $1166 more after controlling all other variables, but interpreting coefficient when the effect is insignificant is not meaningful.
Year: Is statistically significant and the coefficient suggests that for every year increased, there is an associated increase of $476 in salary.
ysdeg: The variable is insignificant. The coefficient suggests that every additional year that passes since degree is associated with $124 less salary.
D)
```{r}
salary$rank <- relevel(salary$rank, ref = 'Prof')
summary(lm(salary ~ ., data = salary))
```
Now, the reference level for rank is Professors and therefore the coeffecients for other levels of rank has been changed accordingly. Assistant Professors make $11118.78 less than that of Professors controlling for other variables. Associate professors make $5826 less than Professors, controlling for all other variables.
E)
```{r}
summary(lm(salary ~ .-rank, data = salary))
```
After removing the rank variable, the regression analysis coefficients chanhged significantly. Now, the coefficient of degreePhD is negative(-3299) which suggests that a faculty with a PhD degree will be paid $3299 less than non-PhD faculty, controlling for other variables. The sign for sex and ysdeg also flipped but are statistically significant.
F)
We need to create a dummy variable 'newDean_hire' for <=15 years after highest degree. Since we are creating the new variable from an existing variable, there is a high chance of multicollinearity. Testing the correlation between the new variable and original variable is a good idea!
```{r}
salary$newDean_hire <- ifelse(salary$ysdeg <=15, 1,0)
cor.test(salary$newDean_hire, salary$ysdeg)
```
The two variables are highly correlated, negatively. So, when we run the regression, we can exclude ysdeg and only include newDean_hire instead, along with other control variables.
```{r}
summary(lm(salary ~ .-ysdeg, data = salary))
```
The newDean_hire variable's coefficient is 2163 and it is statistically significant at a significance level of .05%. It suggests that the dean was giving generous offers to those people who have 15 years or less after their highest degree.
Just checking how the regression will perform if we included both the variable!
```{r}
summary(lm(salary ~ ., data = salary))
```
Now, both the variables are not significant because of the multicollinearity.
Question 3:
Loading the data
```{r}
data("house.selling.price", package = "smss")
head(house.selling.price)
```
A)
```{r}
y_hat <- lm(Price ~ Size + New, data = house.selling.price)
summary(y_hat)
```
Both Size and New are statistically significant variables. When one square foot of size increase there will be $116 increase in the house price, controlling for other variable. The new houses will have $57736 more in price than that of old house, controling for other variable.
B)
The prediction equation is: E[Price] = -40230.867 + 116.132 * Size + 57736.283 * New
For New variable, the values are binary which means it takes 1 as value when the house is new and 0 when it is an old house.
The prediction equation for new houses is : E[Price] = -40230.867 + 116.132 * size + 57736.283 * 1. ie, E[Price] = 17505.42 + 116.132 * size
The prediction equation for old houses is : E[Price] = -40230.867 + 116.132 * size + 57736.283 * 0 = -40230.867 + 116.132 * size
C)
```{r}
# One way of finding the predicted selling price
size <- 3000
new <- 1
not_new <- 0
house_price_new <- -40230.867 + 116.132 * size + 57736.283 * new
house_price_not_new <- -40230.867 + 116.132 * size + 57736.283 * not_new
# Print the results
cat("Price for new houses with 3000 sqft size is: [", house_price_new, "]\n")
cat("Price for houses with 3000 sqft size that are not new is: [", house_price_not_new, "]\n")
# Easy way
predict_data <- data.frame(Size = c(3000, 3000), New = c(1, 0))
predict(y_hat, predict_data)
```
D)
```{r}
y_hat_interaction <- lm(Price ~ Size + New + Size*New , data = house.selling.price)
summary(y_hat_interaction)
```
After adding the interaction between size and new, the coefficient of New reversed the sign but is not statistically significant now. The interaction term is statistically significant at a significance level of 1%.
E)
The prediction equation is: E[Price] = -22227.808 + 104.438 * Size - 78527.502 * New + 61.916 * Size*New
For new house, E[Price] = -22227.808 + 104.438 * Size - 78527.502 * 1 + 61.916 * Size*1
= -100755.3 + 166.354 * Size
For old house, E[Price] = -22227.808 + 104.438 * Size - 78527.502 * 0 + 61.916 * Size*0
= -22227.808 + 104.438 * Size
F)
```{r}
predict_data_interaction <- data.frame(Size = c(3000, 3000), New = c(1, 0))
prediction_with_interaction <- predict(y_hat_interaction, predict_data_interaction)
cat("Price for houses with 3000 sqft size that are new and not new are: [", prediction_with_interaction, "] consecutively \n")
```
New home is $107,220 pricier than old houses of 3000 sqft.
G)
```{r}
predict_data_interaction2 <- data.frame(Size = c(1500, 1500), New = c(1, 0))
prediction_with_interaction2 <- predict(y_hat_interaction, predict_data_interaction2)
cat("Price for houses with 1500 sqft size that are new and not new are: [", prediction_with_interaction2, "] consecutively \n")
```
New home with 1500 sqft has a predicted price of $148776. Old home with the same size has a predicted price of $134429.
New home is $14,347 pricier than old houses of 1500 sqft.
Comparing the differences of prices between new and old houses in F and G, we can conclude that bigger houses have much bigger difference in price between new and old ones.
H)
```{r}
summary(y_hat)
summary(y_hat_interaction)
```
The preferred model is with the intereaction term because the interaction term is significant and the Adjusted R-squared is higher in the model with interaction term. The higher Adjusted R-squared means that despite penalizing for the additional term, we have a better fit model.