Warning: package 'alr4' was built under R version 4.2.3
Loading required package: car
Warning: package 'car' was built under R version 4.2.3
Loading required package: carData
Warning: package 'carData' was built under R version 4.2.3
Attaching package: 'car'
The following object is masked from 'package:dplyr':
recode
The following object is masked from 'package:purrr':
some
Loading required package: effects
Warning: package 'effects' was built under R version 4.2.3
lattice theme set by effectsTheme()
See ?effectsTheme for details.
Code
library(smss)
Warning: package 'smss' was built under R version 4.2.3
Question 1
For this question the prediction equation is Price = -10,536 + 53.8HomeSize + 2.84LotSize
A
When HomeSize = 1240 and LotSize= 18,000, the predicted Price is:
Code
sum(-10,536+ (53.8*1240) + (2.84*1800))
[1] 72350
Since this home actually sold for $145,000, the residual is
Code
sum(72350-145000)
[1] -72650
B
When the lot size remains fixed, the price is predicted to increase $53.80 for every one-square foot increase in size.
C
Given this same equation, if home size remains fixed, the lot size would need to increase by the below in order to have the same impact on price as a one-square foot increase in home size:
Code
sum(53.8/2.84)
[1] 18.94366
Question 2
This question uses the “salary” data from the alr4 package to examine salary and characteristics of faculty in the early 1980s at a small Mid-West college.
Code
data("salary")
A
Code
fit_2a <-lm(salary ~ sex, data = salary)summary(fit_2a)
Call:
lm(formula = salary ~ sex, data = salary)
Residuals:
Min 1Q Median 3Q Max
-8602.8 -4296.6 -100.8 3513.1 16687.9
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24697 938 26.330 <2e-16 ***
sexFemale -3340 1808 -1.847 0.0706 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5782 on 50 degrees of freedom
Multiple R-squared: 0.0639, Adjusted R-squared: 0.04518
F-statistic: 3.413 on 1 and 50 DF, p-value: 0.0706
Based on this analysis, we cannot reject the null hypothesis as the p-value is not below 0.05. Additionally, the low adjusted R-squared value indicates that only 4.52% of the variation in salary can be explained by the gender variable. Furthermore, the results of this model suggest that female faculty members are paid an average of $3,340 less per year compared to male faculty members.
B
The below model adds in degree, rank, year, and ysdeg as additional predictors to the regression model.
Code
fit_2b <-lm(salary ~ sex + degree + rank + year + ysdeg, data = salary)summary(fit_2b)
Call:
lm(formula = salary ~ sex + degree + rank + year + ysdeg, data = salary)
Residuals:
Min 1Q Median 3Q Max
-4045.2 -1094.7 -361.5 813.2 9193.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15746.05 800.18 19.678 < 2e-16 ***
sexFemale 1166.37 925.57 1.260 0.214
degreePhD 1388.61 1018.75 1.363 0.180
rankAssoc 5292.36 1145.40 4.621 3.22e-05 ***
rankProf 11118.76 1351.77 8.225 1.62e-10 ***
year 476.31 94.91 5.018 8.65e-06 ***
ysdeg -124.57 77.49 -1.608 0.115
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared: 0.855, Adjusted R-squared: 0.8357
F-statistic: 44.24 on 6 and 45 DF, p-value: < 2.2e-16
The 95% confidence interval for the difference in salary between males and females is below.
This section interprets the findings for each predictor variable in the above model.
In this model, the variable sex is not statistically significant and suggests that when other predictors are controlled, the difference in salary between males and females is not significant. However, the variable rank is significant and indicates that those with a higher rank have a higher salary. Similarly, the variable year is also significant and suggests that with every year increase in the current rank, the salary also increases. On the other hand, the variable degree and ysdeg are not statistically significant, which means that the difference in salary between those with a PhD and a Master’s or the years since the highest degree was earned is not significant when other predictors are considered.
D
Code
salary$rank <-relevel(salary$rank, ref ="Prof")fit_2d <-lm(salary ~ sex + degree + rank + year + ysdeg, data = salary)summary(fit_2d)
Call:
lm(formula = salary ~ sex + degree + rank + year + ysdeg, data = salary)
Residuals:
Min 1Q Median 3Q Max
-4045.2 -1094.7 -361.5 813.2 9193.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26864.81 1375.29 19.534 < 2e-16 ***
sexFemale 1166.37 925.57 1.260 0.214
degreePhD 1388.61 1018.75 1.363 0.180
rankAsst -11118.76 1351.77 -8.225 1.62e-10 ***
rankAssoc -5826.40 1012.93 -5.752 7.28e-07 ***
year 476.31 94.91 5.018 8.65e-06 ***
ysdeg -124.57 77.49 -1.608 0.115
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared: 0.855, Adjusted R-squared: 0.8357
F-statistic: 44.24 on 6 and 45 DF, p-value: < 2.2e-16
The modification made to the model does not have an impact on its fitness or the coefficients. However, it is important to note that the coefficient for rankAsst is now negative, indicating a decrease of $11,118.76 in salary compared to the reference group of rankProf. This is in contrast to the previous model where rankProf showed an increase of $11,118.76 in salary.
E
This next model removes the variable rank from the model.
Code
fit_2e <-lm(salary ~ sex + degree + year + ysdeg, data = salary)summary(fit_2e)
Call:
lm(formula = salary ~ sex + degree + year + ysdeg, data = salary)
Residuals:
Min 1Q Median 3Q Max
-8146.9 -2186.9 -491.5 2279.1 11186.6
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17183.57 1147.94 14.969 < 2e-16 ***
sexFemale -1286.54 1313.09 -0.980 0.332209
degreePhD -3299.35 1302.52 -2.533 0.014704 *
year 351.97 142.48 2.470 0.017185 *
ysdeg 339.40 80.62 4.210 0.000114 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3744 on 47 degrees of freedom
Multiple R-squared: 0.6312, Adjusted R-squared: 0.5998
F-statistic: 20.11 on 4 and 47 DF, p-value: 1.048e-09
By removing the rank variable from the model, ysdeg and degree become statistically significant predictors. However, the adjusted R-squared decreases compared to the previous two models, indicating that this model explains less of the variation in salary. Additionally, the residual standard error is larger, which suggests that this model may not provide the best fit to the data.
F
In this final model, a new variable called new_hire is created based on ysdeg. Faculty who were hired 15 years ago or less are coded as 1, and those hired earlier are coded as 0. The variable year was removed from the model to avoid multicollinearity since it’s possible that the years since hired and the years in the current rank are the same. As a result, the adjusted R-squared value is higher than the previous models and all predictors are statistically significant, indicating a better fit for this model.
Code
salary$new_hire <-ifelse(salary$ysdeg <=15, 1, 0)fit_2f <-lm(salary ~ sex + degree + new_hire, data = salary)summary(fit_2f)
Call:
lm(formula = salary ~ sex + degree + new_hire, data = salary)
Residuals:
Min 1Q Median 3Q Max
-8260.4 -3557.7 -462.6 3563.2 12098.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 28663 1155 24.821 < 2e-16 ***
sexFemale -2716 1433 -1.896 0.064 .
degreePhD -1227 1372 -0.895 0.375
new_hire -7418 1306 -5.679 7.74e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4558 on 48 degrees of freedom
Multiple R-squared: 0.4416, Adjusted R-squared: 0.4067
F-statistic: 12.65 on 3 and 48 DF, p-value: 3.231e-06
The results of this model suggest that there is a statistically significant difference in salary between faculty members who were hired within the last 15 years and those who were hired more than 15 years ago. Specifically, the data shows that those who were hired by the new dean are earning a lower salary compared to those who were hired earlier than 15 years ago. Therefore, the null hypothesis is rejected.
Question 3
Code
data("house.selling.price")
This questions uses the dataset house.selling.price from the package smss.
A
Code
fit_3a <-lm(Price ~ Size + New, data = house.selling.price)summary(fit_3a)
Call:
lm(formula = Price ~ Size + New, data = house.selling.price)
Residuals:
Min 1Q Median 3Q Max
-205102 -34374 -5778 18929 163866
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -40230.867 14696.140 -2.738 0.00737 **
Size 116.132 8.795 13.204 < 2e-16 ***
New 57736.283 18653.041 3.095 0.00257 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared: 0.7226, Adjusted R-squared: 0.7169
F-statistic: 126.3 on 2 and 97 DF, p-value: < 2.2e-16
In the first model, we investigated the impact of house size and age on price and found that both variables are statistically significant. The results suggest that a 1 unit increase in house size results in a $116.13 rise in price, while holding the age constant. Additionally, the model suggests that when a house is newly constructed, it will cost $57,736.28 more than an older house of the same size.
B
The equation for the predicted selling price when the home is new is: price = -40230.867 + 116.132Size + 57736.283New
C
The predicted selling price for a home of 3000 square feed that is new is below.
Code
df_new <-data.frame(Size =3000, New =1)predict(fit_3a, newdata = df_new)
1
365900.2
The predicted selling price for a home of 3000 square feed that is not new is below.
Code
df_not_new <-data.frame(Size =3000, New =0)predict(fit_3a, newdata = df_not_new)
1
308163.9
D
The next model includes an interaction term between size and new.
Code
fit_3d <-lm(Price ~ Size * New, data = house.selling.price)summary(fit_3d)
Call:
lm(formula = Price ~ Size * New, data = house.selling.price)
Residuals:
Min 1Q Median 3Q Max
-175748 -28979 -6260 14693 192519
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -22227.808 15521.110 -1.432 0.15536
Size 104.438 9.424 11.082 < 2e-16 ***
New -78527.502 51007.642 -1.540 0.12697
Size:New 61.916 21.686 2.855 0.00527 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 52000 on 96 degrees of freedom
Multiple R-squared: 0.7443, Adjusted R-squared: 0.7363
F-statistic: 93.15 on 3 and 96 DF, p-value: < 2.2e-16
The predicted selling price, using the model with interaction terms, for a home of 3000 square feed that is new is below.
Code
predict(fit_3d, newdata = df_new)
1
398307.5
The predicted selling price, using the model with interaction terms, for a home of 3000 square feed that is not new is below.
Code
predict(fit_3d, newdata = df_not_new)
1
291087.4
G
The predicted selling price, using the model with interaction terms, for a home of 1500 square feed that is new is below
Code
df_new <-data.frame(Size =1500, New =1)predict(fit_3d, newdata = df_new)
1
148776.1
The predicted selling price, using the model with interaction terms, for a home of 1500 square feed that is not new is below.
Code
df_not_new <-data.frame(Size =1500, New =0)predict(fit_3d, newdata = df_not_new)
1
134429.8
In comparing the predictions for part F and G, it can be observed that the difference in selling price between a new and not new home increases as the the size of the home increases.
H
Based on the higher adjusted R-squared and lower residual standard error, I consider the model with the interaction term to be a better fit for the relationship between size and new, and their influence on the outcome variable price. This model indicates that the effect of size on price varies depending on whether the house is new or not. Specifically, a 1 unit increase in size leads to a $162.47 increase in price for new houses, while for old houses, the increase in price is only $61.32. Additionally, the intercept for new houses is significantly higher, indicating that a new house of size 0 will cost $179,117.50, while an old house of the same size will cost $42,501.22 less, at $136,616.28.
Source Code
---title: "Homework - 4"author: "Thrishul"description: "Homework 4"date: "04/20/2023"format: html: toc: true code-fold: true code-copy: true code-tools: truecategories: - hw4 - desriptive statistics - probability---# Load the necessary packages```{r}library(tidyverse)library(alr4)library(smss)```# Question 1For this question the prediction equation is Price = -10,536 + 53.8HomeSize + 2.84LotSize## AWhen HomeSize = 1240 and LotSize= 18,000, the predicted Price is:```{r}sum(-10,536+ (53.8*1240) + (2.84*1800))```Since this home actually sold for $145,000, the residual is```{r}sum(72350-145000)```## BWhen the lot size remains fixed, the price is predicted to increase $53.80 for every one-square foot increase in size.## CGiven this same equation, if home size remains fixed, the lot size would need to increase by the below in order to have the same impact on price as a one-square foot increase in home size:```{r}sum(53.8/2.84)```# Question 2This question uses the “salary” data from the alr4 package to examine salary and characteristics of faculty in the early 1980s at a small Mid-West college.```{r}data("salary")```# A```{r}fit_2a <-lm(salary ~ sex, data = salary)summary(fit_2a)```Based on this analysis, we cannot reject the null hypothesis as the p-value is not below 0.05. Additionally, the low adjusted R-squared value indicates that only 4.52% of the variation in salary can be explained by the gender variable. Furthermore, the results of this model suggest that female faculty members are paid an average of $3,340 less per year compared to male faculty members.## BThe below model adds in degree, rank, year, and ysdeg as additional predictors to the regression model.```{r}fit_2b <-lm(salary ~ sex + degree + rank + year + ysdeg, data = salary)summary(fit_2b)```The 95% confidence interval for the difference in salary between males and females is below.```{r}confint(fit_2b)```## CThis section interprets the findings for each predictor variable in the above model.In this model, the variable sex is not statistically significant and suggests that when other predictors are controlled, the difference in salary between males and females is not significant. However, the variable rank is significant and indicates that those with a higher rank have a higher salary. Similarly, the variable year is also significant and suggests that with every year increase in the current rank, the salary also increases. On the other hand, the variable degree and ysdeg are not statistically significant, which means that the difference in salary between those with a PhD and a Master's or the years since the highest degree was earned is not significant when other predictors are considered.## D```{r}salary$rank <-relevel(salary$rank, ref ="Prof")fit_2d <-lm(salary ~ sex + degree + rank + year + ysdeg, data = salary)summary(fit_2d)```The modification made to the model does not have an impact on its fitness or the coefficients. However, it is important to note that the coefficient for rankAsst is now negative, indicating a decrease of $11,118.76 in salary compared to the reference group of rankProf. This is in contrast to the previous model where rankProf showed an increase of $11,118.76 in salary.## EThis next model removes the variable rank from the model.```{r}fit_2e <-lm(salary ~ sex + degree + year + ysdeg, data = salary)summary(fit_2e)```By removing the rank variable from the model, ysdeg and degree become statistically significant predictors. However, the adjusted R-squared decreases compared to the previous two models, indicating that this model explains less of the variation in salary. Additionally, the residual standard error is larger, which suggests that this model may not provide the best fit to the data.## FIn this final model, a new variable called new_hire is created based on ysdeg. Faculty who were hired 15 years ago or less are coded as 1, and those hired earlier are coded as 0. The variable year was removed from the model to avoid multicollinearity since it's possible that the years since hired and the years in the current rank are the same. As a result, the adjusted R-squared value is higher than the previous models and all predictors are statistically significant, indicating a better fit for this model.```{r}salary$new_hire <-ifelse(salary$ysdeg <=15, 1, 0)fit_2f <-lm(salary ~ sex + degree + new_hire, data = salary)summary(fit_2f)```The results of this model suggest that there is a statistically significant difference in salary between faculty members who were hired within the last 15 years and those who were hired more than 15 years ago. Specifically, the data shows that those who were hired by the new dean are earning a lower salary compared to those who were hired earlier than 15 years ago. Therefore, the null hypothesis is rejected.# Question 3```{r}data("house.selling.price")```This questions uses the dataset house.selling.price from the package smss.## A```{r}fit_3a <-lm(Price ~ Size + New, data = house.selling.price)summary(fit_3a)```In the first model, we investigated the impact of house size and age on price and found that both variables are statistically significant. The results suggest that a 1 unit increase in house size results in a $116.13 rise in price, while holding the age constant. Additionally, the model suggests that when a house is newly constructed, it will cost $57,736.28 more than an older house of the same size.## BThe equation for the predicted selling price when the home is new is: price = -40230.867 + 116.132Size + 57736.283New## CThe predicted selling price for a home of 3000 square feed that is new is below.```{r}df_new <-data.frame(Size =3000, New =1)predict(fit_3a, newdata = df_new)```The predicted selling price for a home of 3000 square feed that is not new is below.```{r}df_not_new <-data.frame(Size =3000, New =0)predict(fit_3a, newdata = df_not_new)```## DThe next model includes an interaction term between size and new.```{r}fit_3d <-lm(Price ~ Size * New, data = house.selling.price)summary(fit_3d)```## E```{r}ggplot(house.selling.price,aes(y=Price,x=Size,color=factor(New)))+geom_point()+stat_smooth(method="lm",se=TRUE)```## FThe predicted selling price, using the model with interaction terms, for a home of 3000 square feed that is new is below.```{r}predict(fit_3d, newdata = df_new)```The predicted selling price, using the model with interaction terms, for a home of 3000 square feed that is not new is below.```{r}predict(fit_3d, newdata = df_not_new)```## GThe predicted selling price, using the model with interaction terms, for a home of 1500 square feed that is new is below```{r}df_new <-data.frame(Size =1500, New =1)predict(fit_3d, newdata = df_new)```The predicted selling price, using the model with interaction terms, for a home of 1500 square feed that is not new is below.```{r}df_not_new <-data.frame(Size =1500, New =0)predict(fit_3d, newdata = df_not_new)```In comparing the predictions for part F and G, it can be observed that the difference in selling price between a new and not new home increases as the the size of the home increases.## HBased on the higher adjusted R-squared and lower residual standard error, I consider the model with the interaction term to be a better fit for the relationship between size and new, and their influence on the outcome variable price. This model indicates that the effect of size on price varies depending on whether the house is new or not. Specifically, a 1 unit increase in size leads to a $162.47 increase in price for new houses, while for old houses, the increase in price is only $61.32. Additionally, the intercept for new houses is significantly higher, indicating that a new house of size 0 will cost $179,117.50, while an old house of the same size will cost $42,501.22 less, at $136,616.28.