Code
library(tidyverse)
library(alr4)
library(smss)
::opts_chunk$set(echo = TRUE) knitr
Karen Detter
August 2, 2022
\(\hat{y}\) (predicted selling price) =
residual = observed - predicted
The home in question sold for $37,704 more than the equation predicted, indicating that other variables that were not included in the equation have an impact on selling price.
For fixed lot size, the house selling price is predicted to increase 53.8 for each square foot in home size. When lot size is fixed, that variable is disregarded, leaving \(\hat{y}\) = 53.8\(x_{1}\), which means that for each unit of x, the predicted value of y will increase by 53.8.
Lot size (\(x_{2}\)) would need to increase by 18.94 units to have the same impact as a one unit increase in home size (\(x_{1}\)).
Welch Two Sample t-test
data: salary by sex
t = 1.7744, df = 21.591, p-value = 0.09009
alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
95 percent confidence interval:
-567.8539 7247.1471
sample estimates:
mean in group Male mean in group Female
24696.79 21357.14
Although the sample estimate means show a difference in salary between men and women, the null hypothesis that there is no difference between the two groups cannot be rejected on the basis of this test alone, due to the p-value of .09 being higher than the threshhold of .05.
Call:
lm(formula = salary ~ ., data = salary)
Residuals:
Min 1Q Median 3Q Max
-4045.2 -1094.7 -361.5 813.2 9193.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15746.05 800.18 19.678 < 2e-16 ***
degreePhD 1388.61 1018.75 1.363 0.180
rankAssoc 5292.36 1145.40 4.621 3.22e-05 ***
rankProf 11118.76 1351.77 8.225 1.62e-10 ***
sexFemale 1166.37 925.57 1.260 0.214
year 476.31 94.91 5.018 8.65e-06 ***
ysdeg -124.57 77.49 -1.608 0.115
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared: 0.855, Adjusted R-squared: 0.8357
F-statistic: 44.24 on 6 and 45 DF, p-value: < 2.2e-16
2.5 % 97.5 %
sexFemale -697.8183 3030.565
Because the confidence interval includes 0, it was correct to reject the null hypothesis.
In this model, the intercept shows a base expected salary of 15746 for all observations in the data set, without consideration of any other variables. Rank and years in current rank show statistically significant effects on salary.
Gaining a level of degree (from Masters to PhD) is associated with a salary increase of 1389, but not within the statistically significant threshhold.
Moving from rankAsst to rankAssoc corresponds to an increase in salary of 5292, and moving from rankAsst to rankProf yields a salary increase of 11119.
Each unit of years in current rank corresponds to a salary increase of 476, while each unit of years since highest degree is associated with a decrease in salary of 125, although this association is not statistically significant.
Being female is associated with an increase in salary of 1166, but the relationship is not at the level of statistical significance.
Call:
lm(formula = salary ~ ., data = salary)
Residuals:
Min 1Q Median 3Q Max
-4045.2 -1094.7 -361.5 813.2 9193.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26864.81 1375.29 19.534 < 2e-16 ***
degreePhD 1388.61 1018.75 1.363 0.180
rankAsst -11118.76 1351.77 -8.225 1.62e-10 ***
rankAssoc -5826.40 1012.93 -5.752 7.28e-07 ***
sexFemale 1166.37 925.57 1.260 0.214
year 476.31 94.91 5.018 8.65e-06 ***
ysdeg -124.57 77.49 -1.608 0.115
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared: 0.855, Adjusted R-squared: 0.8357
F-statistic: 44.24 on 6 and 45 DF, p-value: < 2.2e-16
Having the rank of Associate correlates to a salary of 5826 less than the salary correlated to the rank of Professor, and the rank of Assistant correlates to a salary of 11119 less than that of the Professor rank.
Call:
lm(formula = salary ~ degree + sex + year + ysdeg, data = salary)
Residuals:
Min 1Q Median 3Q Max
-8146.9 -2186.9 -491.5 2279.1 11186.6
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17183.57 1147.94 14.969 < 2e-16 ***
degreePhD -3299.35 1302.52 -2.533 0.014704 *
sexFemale -1286.54 1313.09 -0.980 0.332209
year 351.97 142.48 2.470 0.017185 *
ysdeg 339.40 80.62 4.210 0.000114 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3744 on 47 degrees of freedom
Multiple R-squared: 0.6312, Adjusted R-squared: 0.5998
F-statistic: 20.11 on 4 and 47 DF, p-value: 1.048e-09
In this model, being female is associated with a salary decrease of 1287, but the effect is, again, well outside the acceptable range of statistical significance.
Call:
lm(formula = salary ~ . - ysdeg, data = hiring_dean)
Residuals:
Min 1Q Median 3Q Max
-3403.3 -1387.0 -167.0 528.2 9233.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26588.79 1168.06 22.763 < 2e-16 ***
degreePhD 818.93 797.48 1.027 0.3100
rankAsst -11096.95 1191.00 -9.317 4.54e-12 ***
rankAssoc -6124.28 1028.58 -5.954 3.65e-07 ***
sexFemale 907.14 840.54 1.079 0.2862
year 434.85 78.89 5.512 1.65e-06 ***
deanprev -2163.46 1072.04 -2.018 0.0496 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2362 on 45 degrees of freedom
Multiple R-squared: 0.8594, Adjusted R-squared: 0.8407
F-statistic: 45.86 on 6 and 45 DF, p-value: < 2.2e-16
Because the variable ysdeg would, by nature, be highly correlated to the variable dean, ysdeg was omitted in the model examining the effect of dean on salary.
The resulting model shows a statistically significant (p = .05) effect of hiring dean on salary, with hiring by the previous dean correlating to a decrease in salary of 2163.
Call:
lm(formula = Price ~ Size + New, data = house.selling.price)
Residuals:
Min 1Q Median 3Q Max
-205102 -34374 -5778 18929 163866
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -40230.867 14696.140 -2.738 0.00737 **
Size 116.132 8.795 13.204 < 2e-16 ***
New 57736.283 18653.041 3.095 0.00257 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared: 0.7226, Adjusted R-squared: 0.7169
F-statistic: 126.3 on 2 and 97 DF, p-value: < 2.2e-16
Size and whether a house is new each have a statistically significant effect on house price, both with p-values well below the significance threshhold of .05.
Each unit of size is associated with an increase in price of 116, and new houses are associated with a price increase of 57736.
prediction equation:
\(\hat{y}\) = -40231 + 116\(x_{1}\) + 57736\(x_{2}\)
where y = house selling price, \(x_{1}\) = house size (in sq ft), \(x_{2}\) = house is new
alternative prediction equation:
\(\hat{y}\) = -40231 + 116\(x_{1}\) + 0\(x_{2}\)
where y = house selling price, \(x_{1}\) = house size (in sq ft), \(x_{2}\) = house is not new
(i)
3000 sq ft, new house:
\(\hat{y}\) = -40231 + 116(3000) + 57736 = -40231 + 348000 + 57736 = -40231 + 405736 = $365,505
(ii)
3000 sq ft, not new house:
\(\hat{y}\) = -40231 + 116(3000) + 0 = -40231 + 348000 = $307,769
Call:
lm(formula = Price ~ Size + New + Size * New, data = house.selling.price)
Residuals:
Min 1Q Median 3Q Max
-175748 -28979 -6260 14693 192519
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -22227.808 15521.110 -1.432 0.15536
Size 104.438 9.424 11.082 < 2e-16 ***
New -78527.502 51007.642 -1.540 0.12697
Size:New 61.916 21.686 2.855 0.00527 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 52000 on 96 degrees of freedom
Multiple R-squared: 0.7443, Adjusted R-squared: 0.7363
F-statistic: 93.15 on 3 and 96 DF, p-value: < 2.2e-16
The prediction equations generated from this model are:
(i) New
Price = -22228 + 104(Size) - 78528 + 62(Size)
(ii) Not New
Price = -22228 + 104(Size)
Therefore, new houses are associated with an additional price increase of 62 per unit of size increase.
Predicted Prices:
(i) 3000 sq ft New House
Price = -22228 + 104(3000) - 78528 + 62(3000) = $397,244
(ii) 3000 sq ft Not New House
Price = -22228 + 104(3000) = $289,772
Predicted Prices:
(i) 1500 sq ft New House
Price = -22228 + 104(1500) - 78528 + 62(1500) = $148,244
(ii) 1500 sq ft Not New House
Price = -22228 + 104(1500) = $133,772
The ratio between the predicted selling prices of the new and not new, 1500 sq ft house is 1.108. The ratio between the predicted selling prices of the new and not new, 3000 sq ft house is 1.371.
As size increases, the price difference between new and not new houses also increases, indicating that there is an interaction between Size and New.
The model with the interaction term between Size and New seems to better represent the relationship between these variables and Price. Also, the Adjusted \(R^{2}\) for this model is a bit higher, indicating that it explains a higher portion of the variance than the original model.
---
title: "HW 4"
author: "Karen Detter"
desription: "HW 4 - Modeling"
date: "08/02/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- hw4
---
```{r}
#| label: setup
#| warning: false
library(tidyverse)
library(alr4)
library(smss)
knitr::opts_chunk$set(echo = TRUE)
```
## Question 1
### A.
$\hat{y}$ (predicted selling price) =
```{r}
yhat <- (-10536 + (53.8*1240) + (2.84*18000))
yhat
```
residual = observed - predicted
```{r}
res <- 145000 - 107296
res
```
The home in question sold for $37,704 more than the equation predicted, indicating that other variables that were not included in the equation have an impact on selling price.
### B.
For fixed lot size, the house selling price is predicted to increase 53.8 for each square foot in home size. When lot size is fixed, that variable is disregarded, leaving $\hat{y}$ = 53.8$x_{1}$, which means that for each unit of x, the predicted value of y will increase by 53.8.
### C.
```{r}
incr <- 53.8 / 2.84
incr
```
Lot size ($x_{2}$) would need to increase by 18.94 units to have the same impact as a one unit increase in home size ($x_{1}$).
## Question 2
### A.
```{r}
#test hypothesis with two sample t-test
data("salary")
t.test(salary ~ sex, data = salary)
```
Although the sample estimate means show a difference in salary between men and women, the null hypothesis that there is no difference between the two groups cannot be rejected on the basis of this test alone, due to the p-value of .09 being higher than the threshhold of .05.
### B.
```{r}
#run a multiple linear regression with all variables explaining salary
model <- lm(salary ~ ., data = salary)
summary(model)
```
```{r}
#obtain a 95% confidence interval for difference in salary by sex
confint(model, 'sexFemale')
```
Because the confidence interval includes 0, it was correct to reject the null hypothesis.
### C.
In this model, the **intercept** shows a base expected salary of 15746 for all observations in the data set, without consideration of any other variables. **Rank** and **years in current rank** show statistically significant effects on **salary**.
Gaining a level of **degree** (from Masters to PhD) is associated with a salary increase of 1389, but not within the statistically significant threshhold.
Moving from **rankAsst** to **rankAssoc** corresponds to an increase in salary of 5292, and moving from **rankAsst** to **rankProf** yields a salary increase of 11119.
Each unit of **years in current rank** corresponds to a salary increase of 476, while each unit of **years since highest degree** is associated with a *decrease* in salary of 125, although this association is not statistically significant.
Being **female** is associated with an increase in salary of 1166, but the relationship is not at the level of statistical significance.
### D.
```{r}
#change baseline for rank category
salary$rank <- relevel(salary$rank, ref = 'Prof')
summary(lm(salary ~ ., data = salary))
```
Having the rank of Associate correlates to a salary of 5826 less than the salary correlated to the rank of Professor, and the rank of Assistant correlates to a salary of 11119 less than that of the Professor rank.
### E.
```{r}
#refit model excluding the rank variable
model_alt <- lm(salary ~ degree + sex + year + ysdeg, data = salary)
summary(model_alt)
```
In this model, being female is associated with a salary decrease of 1287, but the effect is, again, well outside the acceptable range of statistical significance.
### F.
```{r}
#create new variable for hiring dean
hiring_dean <- salary %>%
mutate(dean =
case_when(`ysdeg` > 15 ~ 'prev',
`ysdeg` <= 15 ~ 'new'))
#fit new model to test hypothesis while avoiding multicollinearity
dean_model <- lm(salary ~ . - ysdeg, data = hiring_dean)
summary(dean_model)
```
Because the variable *ysdeg* would, by nature, be highly correlated to the variable *dean*, *ysdeg* was omitted in the model examining the effect of *dean* on salary.
The resulting model shows a statistically significant (p = .05) effect of hiring dean on salary, with hiring by the previous dean correlating to a decrease in salary of 2163.
## Question 3
### A.
```{r}
data("house.selling.price")
summary(lm(Price ~ Size + New, data = house.selling.price))
```
Size and whether a house is new each have a statistically significant effect on house price, both with p-values well below the significance threshhold of .05.
Each unit of size is associated with an increase in price of 116, and new houses are associated with a price increase of 57736.
### B.
**prediction equation:**
$\hat{y}$ = -40231 + 116$x_{1}$ + 57736$x_{2}$
**where** y = house selling price, $x_{1}$ = house size (in sq ft), $x_{2}$ = house is new
**alternative prediction equation:**
$\hat{y}$ = -40231 + 116$x_{1}$ + 0$x_{2}$
**where** y = house selling price, $x_{1}$ = house size (in sq ft), $x_{2}$ = house is not new
### C.
*(i)*
**3000 sq ft, new house:**
$\hat{y}$ = -40231 + 116(3000) + 57736
= -40231 + 348000 + 57736
= -40231 + 405736
= $365,505
*(ii)*
**3000 sq ft, not new house:**
$\hat{y}$ = -40231 + 116(3000) + 0
= -40231 + 348000
= $307,769
### D.
```{r}
#fit model with an interaction term between variables
summary(lm(Price ~ Size + New + Size*New, data = house.selling.price))
```
### E.
The prediction equations generated from this model are:
**(i) New**
Price = -22228 + 104(Size) - 78528 + 62(Size)
**(ii) Not New**
Price = -22228 + 104(Size)
Therefore, new houses are associated with an additional price increase of 62 per unit of size increase.
### F.
*Predicted Prices:*
**(i) 3000 sq ft New House**
Price = -22228 + 104(3000) - 78528 + 62(3000)
= $397,244
**(ii) 3000 sq ft Not New House**
Price = -22228 + 104(3000)
= $289,772
### G.
*Predicted Prices:*
**(i) 1500 sq ft New House**
Price = -22228 + 104(1500) - 78528 + 62(1500)
= $148,244
**(ii) 1500 sq ft Not New House**
Price = -22228 + 104(1500)
= $133,772
The ratio between the predicted selling prices of the new and not new, 1500 sq ft house is 1.108.
The ratio between the predicted selling prices of the new and not new, 3000 sq ft house is 1.371.
As size increases, the price difference between new and not new houses also increases, indicating that there *is* an interaction between *Size* and *New*.
### H.
The model with the interaction term between *Size* and *New* seems to better represent the relationship between these variables and *Price*. Also, the Adjusted $R^{2}$ for this model is a bit higher, indicating that it explains a higher portion of the variance than the original model.