Homework 4 Solution

hw4

shelton

Author

Dane Shelton

Published

November 14, 2022

Homework 4

(a)

Code

(pred_1 <- -10536 + 53.8*(1240) + 2.84*(18000))
## [1] 107296

(resid_1 <- 145000 - pred_1)
## [1] 37704

The model provided predicts the selling price of a 1,240-sqft house build on a 18,000-sqft lot to be $107,296.
The residual (Actual-Predicted) for this prediction is $37,704. The large positive value indicated the model under-predicts selling price for this house.
The high residual could be due to a competitive market (cash offers, bidding, etc.), location, or even amenities/renovations within the house, all confounders the model doesn’t account for.

(b)

When lot size is fixed, the model predicts selling price to increase $53.80 for each 1-sqft increase in house size.

(c)

$53.8/2.84 = 18.94$ For fixed home size, lot size would need to increase by 18.94-sqft to have the same effect as a one unit increase in home size.

(a)

Code

salary <- alr4::salary

(by_sex <- t.test(formula=`salary`~`sex`, data=salary))
## 
##  Welch Two Sample t-test
## 
## data:  salary by sex
## t = 1.7744, df = 21.591, p-value = 0.09009
## alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
## 95 percent confidence interval:
##  -567.8539 7247.1471
## sample estimates:
##   mean in group Male mean in group Female 
##             24696.79             21357.14

At alpha=0.05 we do not have sufficient evidence to reject the null hypothesis, H0: There is no difference in Male and Female mean salaries. We cannot conclude a difference exists between male and female salaries.

(b)

Code

salary_1 <- lm(salary ~.,
               data = salary)
summary(salary_1)
## 
## Call:
## lm(formula = salary ~ ., data = salary)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4045.2 -1094.7  -361.5   813.2  9193.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15746.05     800.18  19.678  < 2e-16 ***
## degreePhD    1388.61    1018.75   1.363    0.180    
## rankAssoc    5292.36    1145.40   4.621 3.22e-05 ***
## rankProf    11118.76    1351.77   8.225 1.62e-10 ***
## sexFemale    1166.37     925.57   1.260    0.214    
## year          476.31      94.91   5.018 8.65e-06 ***
## ysdeg        -124.57      77.49  -1.608    0.115    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2398 on 45 degrees of freedom
## Multiple R-squared:  0.855,  Adjusted R-squared:  0.8357 
## F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16

print('95% Confidence Interval:')
## [1] "95% Confidence Interval:"

(confint.lm(salary_1, level = 0.95))
##                  2.5 %      97.5 %
## (Intercept) 14134.4059 17357.68946
## degreePhD    -663.2482  3440.47485
## rankAssoc    2985.4107  7599.31080
## rankProf     8396.1546 13841.37340
## sexFemale    -697.8183  3030.56452
## year          285.1433   667.47476
## ysdeg        -280.6397    31.49105

We can see the results of our hypothesis test in (a) are confirmed by the confidence interval for sexFemale which contains 0.

(c)

degreePhD: 1388.61, p-value 0.180

degree is a categorical variable with 2 levels: Masters and PhD. Our coefficient of 1388.61 suggests that with all other variables held constant, the model finds the difference in mean salary between PhD recipients and Masters recipients at the university to be $1388.61, with faculty holding a PhD earning more.
The p-value of 0.180 indicates that the variable is not significant at the alpha= .05 level, we cannot conclude that degree level is useful in predicting salary at the university (H0: B slope degreePhD = 0). There is not a significant difference between salary of Masters and Phd faculty.

rankAssoc: 5292.36, p-value .00003

rank is a categorical variable with 3 levels: Asst,Assoc, Prof, representing the rank of a faculty member. Asst is the base category in this model, so our coefficient of 5292.36 suggests that with all other variables held constant, there is a $5,292.36 difference in mean salary between Assistant and Associate professors on campus, with associate professors earning more.
the p-value is far below our default alpha=0.05. We can reject the null hypothesis that the effect is 0, and conclude that there is a difference between the salaries of the groups Asst and Assoc, the variable is significant in predicting salary.

rankProf: 11118.76, p-value: < 0.005

A coefficient of 11118.76 suggests that with all other variables held constant, there is a $11,118.76 difference in mean salary between Assistant and full Professors on campus, with full Professors earning more.
the p-value is far below our default alpha=0.05. We can reject the null hypothesis that the effect is 0, and conclude that there is a difference between the salaries of the groups Asst and Prof, the variable is significant in predicting salary.

sexFemale: 1166.37, p-value 0.214

sex is a categorical varible with two levels: Male and Female. A coefficient of 1166.37 indicated that with all other variables held constant, the model observed a $1,166.37 difference in mean salary between males and females, with females earning more.
the p-value of 0.214 is greater than our alpha value, indicating this variable is not useful in predicting salary of faculty members. We cannot conclude there is a significant difference between earnings of different sex levels.

year: 476.31, p-value < .0005

year is a continuous variable describing the number of years a faculty member has spent at their current rank. A coefficient of 476.31 suggests that with all other variables held constant, a 1 year increase in experience at a particular rank will result in a $476.31 increase in predicted salary.
the p-value is far below our default alpha=0.05. We can reject the null hypothesis that the effect is 0, and conclude the variable is significant in predicting salary.

ysdeg: -124.57, p-value 0.115

ysdeg is a continuous variable describing the number of years since a faculty member earned their highest degree. A coefficient of -124.57 suggests that with all other variables held constant, a 1 year increase in years since highest degree earned will result in a $124.57 decrease in predicted salary.
the p-value of 0.115 is greater than our alpha value, indicating this variable is not useful in predicting salary of faculty members. We cannot conclude there is a significant relationship between salary and ysdeg.

(d)

Code

rank_redo<- salary %>%
              mutate(rank=relevel(rank, ref = 'Prof'))
 
# refit model
salary_2 <- lm(salary ~.,
               data = rank_redo)
summary(salary_2)
## 
## Call:
## lm(formula = salary ~ ., data = rank_redo)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4045.2 -1094.7  -361.5   813.2  9193.1 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  26864.81    1375.29  19.534  < 2e-16 ***
## degreePhD     1388.61    1018.75   1.363    0.180    
## rankAsst    -11118.76    1351.77  -8.225 1.62e-10 ***
## rankAssoc    -5826.40    1012.93  -5.752 7.28e-07 ***
## sexFemale     1166.37     925.57   1.260    0.214    
## year           476.31      94.91   5.018 8.65e-06 ***
## ysdeg         -124.57      77.49  -1.608    0.115    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2398 on 45 degrees of freedom
## Multiple R-squared:  0.855,  Adjusted R-squared:  0.8357 
## F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16

While the difference in predicted mean salary remains the same for the Assist group, after changing the reference level of the variable from Assistant to Professor, the relationship between full professors and associate professors has changed. Associate professors are seen to make $5,826.40 less than full professors, a 534 dollar increase (in value) from the model where Assistants are used the point of reference.

(e)

Code

salary_3 <- lm(salary ~ . - rank,
               data = salary)
summary(salary_3)
## 
## Call:
## lm(formula = salary ~ . - rank, data = salary)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8146.9 -2186.9  -491.5  2279.1 11186.6 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 17183.57    1147.94  14.969  < 2e-16 ***
## degreePhD   -3299.35    1302.52  -2.533 0.014704 *  
## sexFemale   -1286.54    1313.09  -0.980 0.332209    
## year          351.97     142.48   2.470 0.017185 *  
## ysdeg         339.40      80.62   4.210 0.000114 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3744 on 47 degrees of freedom
## Multiple R-squared:  0.6312, Adjusted R-squared:  0.5998 
## F-statistic: 20.11 on 4 and 47 DF,  p-value: 1.048e-09

While sex is still insignificant in the model for predicting salary, we see the value of the coefficient change sign, along with ysdeg.
Two variables are now considered significant that were previously in predicting salary when rank was included in the model, ysdeg and degree.

(f)

Code

dean_edit <- salary %>%
              mutate(dean = 
                       case_when(`ysdeg` > 15 ~ 'Old',
                                 `ysdeg` <= 15 ~ 'New'))
salary_4 <- lm(salary ~ . - rank - ysdeg,
               data = dean_edit)
summary(salary_4)
## 
## Call:
## lm(formula = salary ~ . - rank - ysdeg, data = dean_edit)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10740.1  -2550.1     -3.3   1942.4  11718.3 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  18148.9     1188.2  15.274  < 2e-16 ***
## degreePhD    -1186.6     1191.2  -0.996 0.324267    
## sexFemale     -523.5     1355.1  -0.386 0.701017    
## year           531.4      130.2   4.082 0.000172 ***
## deanOld       4449.8     1347.2   3.303 0.001834 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3958 on 47 degrees of freedom
## Multiple R-squared:  0.5878, Adjusted R-squared:  0.5527 
## F-statistic: 16.75 on 4 and 47 DF,  p-value: 1.338e-08

I removed rank and ysdeg as both could exhibit multicollinearity with the hiring of the new Dean. Knowing the dean appointed newer graduates upon their appointment would allow ysdeg to predict dean for those with under 15 years of their degree; also one can progress through ranks within 15 years so this could correlate with dean also.
With a coefficient for deanOld of around 4500, the model predicts that those hired by the Old dean have a higher mean salary as a group compared to those hired by the New dean, contrary to the suggested hypothesis.

(a)

Code

data("house.selling.price")
house <- house.selling.price

house_1 <- lm(Price ~ Size+New, data=house)
summary(house_1)
## 
## Call:
## lm(formula = Price ~ Size + New, data = house)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -205102  -34374   -5778   18929  163866 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -40230.867  14696.140  -2.738  0.00737 ** 
## Size           116.132      8.795  13.204  < 2e-16 ***
## New          57736.283  18653.041   3.095  0.00257 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 53880 on 97 degrees of freedom
## Multiple R-squared:  0.7226, Adjusted R-squared:  0.7169 
## F-statistic: 126.3 on 2 and 97 DF,  p-value: < 2.2e-16

Size: 116.132, p-value < .005

Size is a continuous variable with a positive coefficient of 116.132 suggesting that for every one sqft increase, predicted Price will increase by $116.13, New held constant.

New: 57736.28, p-value < .005

New is a categorical variable with two levels, new=1 and old=0. The coefficient of 57736.28 suggests that as a group, our model predicts new houses to have a mean selling price that is $57,736.28 greater than old houses of the same size.

Both variables are significant at alpha=0.05, indicating that they are useful in predicting price for our data, their effect has a magnitude different from zero.

(b)

The full prediction equation follows the form $Price = -40230.867 + 116.132*Size + 57736.28*New$

Interpretation of variables, coefficients, and p-values are in (a).

New Homes: $Price = -40230.867 + 116.132*Size + 57736.28$

Old Homes: $Price = -40230.867 + 116.132*Size$

(c)

Code

new_3000 <- data.frame(`Size`=3000,`New`=1)
new_pred <- predict(house_1, newdata=new_3000)
print(c('New 3000 sqft House:',new_pred))
##                                             1 
## "New 3000 sqft House:"     "365900.183656625"

old_3000 <- data.frame(`Size`=3000,`New`=0)
old_pred <- predict(house_1, newdata=old_3000)
print(c('Old 3000 sqft House:',old_pred))
##                                             1 
## "Old 3000 sqft House:"     "308163.900855831"

(d)

Code

house_2 <- lm(Price ~ Size+New+ Size*New, data=house)
summary(house_2)
## 
## Call:
## lm(formula = Price ~ Size + New + Size * New, data = house)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -175748  -28979   -6260   14693  192519 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -22227.808  15521.110  -1.432  0.15536    
## Size           104.438      9.424  11.082  < 2e-16 ***
## New         -78527.502  51007.642  -1.540  0.12697    
## Size:New        61.916     21.686   2.855  0.00527 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 52000 on 96 degrees of freedom
## Multiple R-squared:  0.7443, Adjusted R-squared:  0.7363 
## F-statistic: 93.15 on 3 and 96 DF,  p-value: < 2.2e-16

(e)

The full prediction equation follows the form $Price = -22227.808 + 104.438*Size -78527.502*New + 61.916*Size:New$

Interpretation of variables, coefficients, and p-values are in (a).

New Homes: $Price = -22227.808 + 104.438*Size -78527.502 + 61.916*Size$

Old Homes: $Price = -22227.808 + 104.438*Size$

(f)

Code

new_3000_int <- data.frame(`Size`=3000,`New`=1)
new_pred_int <- predict(house_2, newdata=new_3000_int)
print(c('New 3000 sqft House:',new_pred_int))
##                                             1 
## "New 3000 sqft House:"     "398307.512638058"

old_3000_int <- data.frame(`Size`=3000,`New`=0)
old_pred_int <- predict(house_2, newdata=old_3000_int)
print(c('Old 3000 sqft House:',old_pred_int))
##                                             1 
## "Old 3000 sqft House:"     "291087.363770394"

(g)

Code

new_1500_int <- data.frame(`Size`=1500,`New`=1)
new_pred_int2 <- predict(house_2, newdata=new_1500_int)
print(c('New 1500 sqft House:',new_pred_int2))
##                                             1 
## "New 1500 sqft House:"     "148776.101180188"

old_1500_int <- data.frame(`Size`=1500,`New`=0)
old_pred_int2 <- predict(house_2, newdata=old_1500_int)
print(c('Old 1500 sqft House:',old_pred_int2))
##                                             1 
## "Old 1500 sqft House:"     "134429.777919781"

The difference in price between old and new homes increases with an increase in size. The impact of New has a different magnitude at different sizes.

(h)

I would select the model with an interaction term has it has a larger Adjusted R-sq value as well as a smaller Residual Standard Error value, indicating a slighly better fit than the model without interactions.