HW 4

hw4

Author

Karen Detter

Published

August 2, 2022

Code

library(tidyverse)
library(alr4)
library(smss)

knitr::opts_chunk$set(echo = TRUE)

Question 1

A.

$\hat{y}$ (predicted selling price) =

Code

yhat <- (-10536 + (53.8*1240) + (2.84*18000))
yhat

[1] 107296

residual = observed - predicted

Code

res <- 145000 - 107296
res

[1] 37704

The home in question sold for $37,704 more than the equation predicted, indicating that other variables that were not included in the equation have an impact on selling price.

B.

For fixed lot size, the house selling price is predicted to increase 53.8 for each square foot in home size. When lot size is fixed, that variable is disregarded, leaving $\hat{y}$ = 53.8$x_{1}$, which means that for each unit of x, the predicted value of y will increase by 53.8.

C.

Code

incr <- 53.8 / 2.84
incr

[1] 18.94366

Lot size ($x_{2}$) would need to increase by 18.94 units to have the same impact as a one unit increase in home size ($x_{1}$).

Question 2

A.

Code

#test hypothesis with two sample t-test

data("salary")
t.test(salary ~ sex, data = salary)


    Welch Two Sample t-test

data:  salary by sex
t = 1.7744, df = 21.591, p-value = 0.09009
alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
95 percent confidence interval:
 -567.8539 7247.1471
sample estimates:
  mean in group Male mean in group Female 
            24696.79             21357.14

Although the sample estimate means show a difference in salary between men and women, the null hypothesis that there is no difference between the two groups cannot be rejected on the basis of this test alone, due to the p-value of .09 being higher than the threshhold of .05.

B.

Code

#run a multiple linear regression with all variables explaining salary

model <- lm(salary ~ ., data = salary)
summary(model)


Call:
lm(formula = salary ~ ., data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 15746.05     800.18  19.678  < 2e-16 ***
degreePhD    1388.61    1018.75   1.363    0.180    
rankAssoc    5292.36    1145.40   4.621 3.22e-05 ***
rankProf    11118.76    1351.77   8.225 1.62e-10 ***
sexFemale    1166.37     925.57   1.260    0.214    
year          476.31      94.91   5.018 8.65e-06 ***
ysdeg        -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16

Code

#obtain a 95% confidence interval for difference in salary by sex

confint(model, 'sexFemale')

              2.5 %   97.5 %
sexFemale -697.8183 3030.565

Because the confidence interval includes 0, it was correct to reject the null hypothesis.

C.

In this model, the intercept shows a base expected salary of 15746 for all observations in the data set, without consideration of any other variables. Rank and years in current rank show statistically significant effects on salary.

Gaining a level of degree (from Masters to PhD) is associated with a salary increase of 1389, but not within the statistically significant threshhold.

Moving from rankAsst to rankAssoc corresponds to an increase in salary of 5292, and moving from rankAsst to rankProf yields a salary increase of 11119.

Each unit of years in current rank corresponds to a salary increase of 476, while each unit of years since highest degree is associated with a decrease in salary of 125, although this association is not statistically significant.

Being female is associated with an increase in salary of 1166, but the relationship is not at the level of statistical significance.

D.

Code

#change baseline for rank category
salary$rank <- relevel(salary$rank, ref = 'Prof')

summary(lm(salary ~ ., data = salary))


Call:
lm(formula = salary ~ ., data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  26864.81    1375.29  19.534  < 2e-16 ***
degreePhD     1388.61    1018.75   1.363    0.180    
rankAsst    -11118.76    1351.77  -8.225 1.62e-10 ***
rankAssoc    -5826.40    1012.93  -5.752 7.28e-07 ***
sexFemale     1166.37     925.57   1.260    0.214    
year           476.31      94.91   5.018 8.65e-06 ***
ysdeg         -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16

Having the rank of Associate correlates to a salary of 5826 less than the salary correlated to the rank of Professor, and the rank of Assistant correlates to a salary of 11119 less than that of the Professor rank.

E.

Code

#refit model excluding the rank variable
model_alt <- lm(salary ~ degree + sex + year + ysdeg, data = salary)
summary(model_alt)


Call:
lm(formula = salary ~ degree + sex + year + ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-8146.9 -2186.9  -491.5  2279.1 11186.6 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 17183.57    1147.94  14.969  < 2e-16 ***
degreePhD   -3299.35    1302.52  -2.533 0.014704 *  
sexFemale   -1286.54    1313.09  -0.980 0.332209    
year          351.97     142.48   2.470 0.017185 *  
ysdeg         339.40      80.62   4.210 0.000114 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3744 on 47 degrees of freedom
Multiple R-squared:  0.6312,    Adjusted R-squared:  0.5998 
F-statistic: 20.11 on 4 and 47 DF,  p-value: 1.048e-09

In this model, being female is associated with a salary decrease of 1287, but the effect is, again, well outside the acceptable range of statistical significance.

F.

Code

#create new variable for hiring dean
hiring_dean <- salary %>%
              mutate(dean = 
                       case_when(`ysdeg` > 15 ~ 'prev',
                                 `ysdeg` <= 15 ~ 'new'))

#fit new model to test hypothesis while avoiding multicollinearity
dean_model <- lm(salary ~ . - ysdeg, data = hiring_dean)
summary(dean_model)


Call:
lm(formula = salary ~ . - ysdeg, data = hiring_dean)

Residuals:
    Min      1Q  Median      3Q     Max 
-3403.3 -1387.0  -167.0   528.2  9233.8 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  26588.79    1168.06  22.763  < 2e-16 ***
degreePhD      818.93     797.48   1.027   0.3100    
rankAsst    -11096.95    1191.00  -9.317 4.54e-12 ***
rankAssoc    -6124.28    1028.58  -5.954 3.65e-07 ***
sexFemale      907.14     840.54   1.079   0.2862    
year           434.85      78.89   5.512 1.65e-06 ***
deanprev     -2163.46    1072.04  -2.018   0.0496 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2362 on 45 degrees of freedom
Multiple R-squared:  0.8594,    Adjusted R-squared:  0.8407 
F-statistic: 45.86 on 6 and 45 DF,  p-value: < 2.2e-16

Because the variable ysdeg would, by nature, be highly correlated to the variable dean, ysdeg was omitted in the model examining the effect of dean on salary.

The resulting model shows a statistically significant (p = .05) effect of hiring dean on salary, with hiring by the previous dean correlating to a decrease in salary of 2163.

Question 3

A.

Code

data("house.selling.price")
summary(lm(Price ~ Size + New, data = house.selling.price))


Call:
lm(formula = Price ~ Size + New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-205102  -34374   -5778   18929  163866 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -40230.867  14696.140  -2.738  0.00737 ** 
Size           116.132      8.795  13.204  < 2e-16 ***
New          57736.283  18653.041   3.095  0.00257 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared:  0.7226,    Adjusted R-squared:  0.7169 
F-statistic: 126.3 on 2 and 97 DF,  p-value: < 2.2e-16

Size and whether a house is new each have a statistically significant effect on house price, both with p-values well below the significance threshhold of .05.

Each unit of size is associated with an increase in price of 116, and new houses are associated with a price increase of 57736.

B.

prediction equation:

$\hat{y}$ = -40231 + 116$x_{1}$ + 57736$x_{2}$

where y = house selling price, $x_{1}$ = house size (in sq ft), $x_{2}$ = house is new

alternative prediction equation:

$\hat{y}$ = -40231 + 116$x_{1}$ + 0$x_{2}$

where y = house selling price, $x_{1}$ = house size (in sq ft), $x_{2}$ = house is not new

C.

(i)

3000 sq ft, new house:

$\hat{y}$ = -40231 + 116(3000) + 57736 = -40231 + 348000 + 57736 = -40231 + 405736 = $365,505

(ii)

3000 sq ft, not new house:

$\hat{y}$ = -40231 + 116(3000) + 0 = -40231 + 348000 = $307,769

D.

Code

#fit model with an interaction term between variables

summary(lm(Price ~ Size + New + Size*New, data = house.selling.price))


Call:
lm(formula = Price ~ Size + New + Size * New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-175748  -28979   -6260   14693  192519 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -22227.808  15521.110  -1.432  0.15536    
Size           104.438      9.424  11.082  < 2e-16 ***
New         -78527.502  51007.642  -1.540  0.12697    
Size:New        61.916     21.686   2.855  0.00527 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 52000 on 96 degrees of freedom
Multiple R-squared:  0.7443,    Adjusted R-squared:  0.7363 
F-statistic: 93.15 on 3 and 96 DF,  p-value: < 2.2e-16

E.

The prediction equations generated from this model are:

(i) New

Price = -22228 + 104(Size) - 78528 + 62(Size)

(ii) Not New

Price = -22228 + 104(Size)

Therefore, new houses are associated with an additional price increase of 62 per unit of size increase.

F.

Predicted Prices:

(i) 3000 sq ft New House

Price = -22228 + 104(3000) - 78528 + 62(3000) = $397,244

(ii) 3000 sq ft Not New House

Price = -22228 + 104(3000) = $289,772

G.

Predicted Prices:

(i) 1500 sq ft New House

Price = -22228 + 104(1500) - 78528 + 62(1500) = $148,244

(ii) 1500 sq ft Not New House

Price = -22228 + 104(1500) = $133,772

The ratio between the predicted selling prices of the new and not new, 1500 sq ft house is 1.108. The ratio between the predicted selling prices of the new and not new, 3000 sq ft house is 1.371.

As size increases, the price difference between new and not new houses also increases, indicating that there is an interaction between Size and New.

H.

The model with the interaction term between Size and New seems to better represent the relationship between these variables and Price. Also, the Adjusted $R^{2}$ for this model is a bit higher, indicating that it explains a higher portion of the variance than the original model.