hw4
desriptive statistics
probability
Homework 4
Author

Thrishul

Published

April 20, 2023

Load the necessary packages

Code
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0     ✔ purrr   1.0.1
✔ tibble  3.1.8     ✔ dplyr   1.1.0
✔ tidyr   1.3.0     ✔ stringr 1.5.0
✔ readr   2.1.4     ✔ forcats 1.0.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Code
library(alr4)
Warning: package 'alr4' was built under R version 4.2.3
Loading required package: car
Warning: package 'car' was built under R version 4.2.3
Loading required package: carData
Warning: package 'carData' was built under R version 4.2.3

Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some

Loading required package: effects
Warning: package 'effects' was built under R version 4.2.3
lattice theme set by effectsTheme()
See ?effectsTheme for details.
Code
library(smss)
Warning: package 'smss' was built under R version 4.2.3

Question 1

For this question the prediction equation is Price = -10,536 + 53.8HomeSize + 2.84LotSize

A

When HomeSize = 1240 and LotSize= 18,000, the predicted Price is:

Code
sum(-10,536 + (53.8*1240) + (2.84*1800))
[1] 72350

Since this home actually sold for $145,000, the residual is

Code
sum(72350-145000)
[1] -72650

B

When the lot size remains fixed, the price is predicted to increase $53.80 for every one-square foot increase in size.

C

Given this same equation, if home size remains fixed, the lot size would need to increase by the below in order to have the same impact on price as a one-square foot increase in home size:

Code
sum(53.8/2.84)
[1] 18.94366

Question 2

This question uses the “salary” data from the alr4 package to examine salary and characteristics of faculty in the early 1980s at a small Mid-West college.

Code
data("salary")

A

Code
fit_2a <- lm(salary ~ sex, data = salary)

summary(fit_2a)

Call:
lm(formula = salary ~ sex, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-8602.8 -4296.6  -100.8  3513.1 16687.9 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)    24697        938  26.330   <2e-16 ***
sexFemale      -3340       1808  -1.847   0.0706 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5782 on 50 degrees of freedom
Multiple R-squared:  0.0639,    Adjusted R-squared:  0.04518 
F-statistic: 3.413 on 1 and 50 DF,  p-value: 0.0706

Based on this analysis, we cannot reject the null hypothesis as the p-value is not below 0.05. Additionally, the low adjusted R-squared value indicates that only 4.52% of the variation in salary can be explained by the gender variable. Furthermore, the results of this model suggest that female faculty members are paid an average of $3,340 less per year compared to male faculty members.

B

The below model adds in degree, rank, year, and ysdeg as additional predictors to the regression model.

Code
fit_2b <- lm(salary ~ sex + degree + rank + year + ysdeg, data = salary)

summary(fit_2b)

Call:
lm(formula = salary ~ sex + degree + rank + year + ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 15746.05     800.18  19.678  < 2e-16 ***
sexFemale    1166.37     925.57   1.260    0.214    
degreePhD    1388.61    1018.75   1.363    0.180    
rankAssoc    5292.36    1145.40   4.621 3.22e-05 ***
rankProf    11118.76    1351.77   8.225 1.62e-10 ***
year          476.31      94.91   5.018 8.65e-06 ***
ysdeg        -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16

The 95% confidence interval for the difference in salary between males and females is below.

Code
confint(fit_2b)
                 2.5 %      97.5 %
(Intercept) 14134.4059 17357.68946
sexFemale    -697.8183  3030.56452
degreePhD    -663.2482  3440.47485
rankAssoc    2985.4107  7599.31080
rankProf     8396.1546 13841.37340
year          285.1433   667.47476
ysdeg        -280.6397    31.49105

C

This section interprets the findings for each predictor variable in the above model.

In this model, the variable sex is not statistically significant and suggests that when other predictors are controlled, the difference in salary between males and females is not significant. However, the variable rank is significant and indicates that those with a higher rank have a higher salary. Similarly, the variable year is also significant and suggests that with every year increase in the current rank, the salary also increases. On the other hand, the variable degree and ysdeg are not statistically significant, which means that the difference in salary between those with a PhD and a Master’s or the years since the highest degree was earned is not significant when other predictors are considered.

D

Code
salary$rank <- relevel(salary$rank, ref = "Prof")

fit_2d <- lm(salary ~ sex + degree + rank + year + ysdeg, data = salary)

summary(fit_2d)

Call:
lm(formula = salary ~ sex + degree + rank + year + ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  26864.81    1375.29  19.534  < 2e-16 ***
sexFemale     1166.37     925.57   1.260    0.214    
degreePhD     1388.61    1018.75   1.363    0.180    
rankAsst    -11118.76    1351.77  -8.225 1.62e-10 ***
rankAssoc    -5826.40    1012.93  -5.752 7.28e-07 ***
year           476.31      94.91   5.018 8.65e-06 ***
ysdeg         -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16

The modification made to the model does not have an impact on its fitness or the coefficients. However, it is important to note that the coefficient for rankAsst is now negative, indicating a decrease of $11,118.76 in salary compared to the reference group of rankProf. This is in contrast to the previous model where rankProf showed an increase of $11,118.76 in salary.

E

This next model removes the variable rank from the model.

Code
fit_2e <- lm(salary ~ sex + degree + year + ysdeg, data = salary)

summary(fit_2e)

Call:
lm(formula = salary ~ sex + degree + year + ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-8146.9 -2186.9  -491.5  2279.1 11186.6 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 17183.57    1147.94  14.969  < 2e-16 ***
sexFemale   -1286.54    1313.09  -0.980 0.332209    
degreePhD   -3299.35    1302.52  -2.533 0.014704 *  
year          351.97     142.48   2.470 0.017185 *  
ysdeg         339.40      80.62   4.210 0.000114 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3744 on 47 degrees of freedom
Multiple R-squared:  0.6312,    Adjusted R-squared:  0.5998 
F-statistic: 20.11 on 4 and 47 DF,  p-value: 1.048e-09

By removing the rank variable from the model, ysdeg and degree become statistically significant predictors. However, the adjusted R-squared decreases compared to the previous two models, indicating that this model explains less of the variation in salary. Additionally, the residual standard error is larger, which suggests that this model may not provide the best fit to the data.

F

In this final model, a new variable called new_hire is created based on ysdeg. Faculty who were hired 15 years ago or less are coded as 1, and those hired earlier are coded as 0. The variable year was removed from the model to avoid multicollinearity since it’s possible that the years since hired and the years in the current rank are the same. As a result, the adjusted R-squared value is higher than the previous models and all predictors are statistically significant, indicating a better fit for this model.

Code
salary$new_hire <- ifelse(salary$ysdeg <= 15, 1, 0)

fit_2f <- lm(salary ~ sex + degree + new_hire, data = salary)

summary(fit_2f)

Call:
lm(formula = salary ~ sex + degree + new_hire, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-8260.4 -3557.7  -462.6  3563.2 12098.5 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)    28663       1155  24.821  < 2e-16 ***
sexFemale      -2716       1433  -1.896    0.064 .  
degreePhD      -1227       1372  -0.895    0.375    
new_hire       -7418       1306  -5.679 7.74e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4558 on 48 degrees of freedom
Multiple R-squared:  0.4416,    Adjusted R-squared:  0.4067 
F-statistic: 12.65 on 3 and 48 DF,  p-value: 3.231e-06

The results of this model suggest that there is a statistically significant difference in salary between faculty members who were hired within the last 15 years and those who were hired more than 15 years ago. Specifically, the data shows that those who were hired by the new dean are earning a lower salary compared to those who were hired earlier than 15 years ago. Therefore, the null hypothesis is rejected.

Question 3

Code
data("house.selling.price")

This questions uses the dataset house.selling.price from the package smss.

A

Code
fit_3a <- lm(Price ~ Size + New, data = house.selling.price)

summary(fit_3a)

Call:
lm(formula = Price ~ Size + New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-205102  -34374   -5778   18929  163866 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -40230.867  14696.140  -2.738  0.00737 ** 
Size           116.132      8.795  13.204  < 2e-16 ***
New          57736.283  18653.041   3.095  0.00257 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared:  0.7226,    Adjusted R-squared:  0.7169 
F-statistic: 126.3 on 2 and 97 DF,  p-value: < 2.2e-16

In the first model, we investigated the impact of house size and age on price and found that both variables are statistically significant. The results suggest that a 1 unit increase in house size results in a $116.13 rise in price, while holding the age constant. Additionally, the model suggests that when a house is newly constructed, it will cost $57,736.28 more than an older house of the same size.

B

The equation for the predicted selling price when the home is new is: price = -40230.867 + 116.132Size + 57736.283New

C

The predicted selling price for a home of 3000 square feed that is new is below.

Code
df_new <- data.frame(Size = 3000, New = 1)

predict(fit_3a, newdata = df_new)
       1 
365900.2 

The predicted selling price for a home of 3000 square feed that is not new is below.

Code
df_not_new <- data.frame(Size = 3000, New = 0)

predict(fit_3a, newdata = df_not_new)
       1 
308163.9 

D

The next model includes an interaction term between size and new.

Code
fit_3d <- lm(Price ~ Size * New, data = house.selling.price)

summary(fit_3d)

Call:
lm(formula = Price ~ Size * New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-175748  -28979   -6260   14693  192519 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -22227.808  15521.110  -1.432  0.15536    
Size           104.438      9.424  11.082  < 2e-16 ***
New         -78527.502  51007.642  -1.540  0.12697    
Size:New        61.916     21.686   2.855  0.00527 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 52000 on 96 degrees of freedom
Multiple R-squared:  0.7443,    Adjusted R-squared:  0.7363 
F-statistic: 93.15 on 3 and 96 DF,  p-value: < 2.2e-16

E

Code
ggplot(house.selling.price,aes(y=Price,x=Size,color=factor(New)))+
  geom_point()+
  stat_smooth(method="lm",se=TRUE)
`geom_smooth()` using formula = 'y ~ x'

F

The predicted selling price, using the model with interaction terms, for a home of 3000 square feed that is new is below.

Code
predict(fit_3d, newdata = df_new)
       1 
398307.5 

The predicted selling price, using the model with interaction terms, for a home of 3000 square feed that is not new is below.

Code
predict(fit_3d, newdata = df_not_new)
       1 
291087.4 

G

The predicted selling price, using the model with interaction terms, for a home of 1500 square feed that is new is below

Code
df_new <- data.frame(Size = 1500, New = 1)

predict(fit_3d, newdata = df_new)
       1 
148776.1 

The predicted selling price, using the model with interaction terms, for a home of 1500 square feed that is not new is below.

Code
df_not_new <- data.frame(Size = 1500, New = 0)

predict(fit_3d, newdata = df_not_new)
       1 
134429.8 

In comparing the predictions for part F and G, it can be observed that the difference in selling price between a new and not new home increases as the the size of the home increases.

H

Based on the higher adjusted R-squared and lower residual standard error, I consider the model with the interaction term to be a better fit for the relationship between size and new, and their influence on the outcome variable price. This model indicates that the effect of size on price varies depending on whether the house is new or not. Specifically, a 1 unit increase in size leads to a $162.47 increase in price for new houses, while for old houses, the increase in price is only $61.32. Additionally, the intercept for new houses is significantly higher, indicating that a new house of size 0 will cost $179,117.50, while an old house of the same size will cost $42,501.22 less, at $136,616.28.