Loading required package: car
Loading required package: carData
Attaching package: 'car'
The following object is masked from 'package:dplyr':
recode
The following object is masked from 'package:purrr':
some
Loading required package: effects
lattice theme set by effectsTheme()
See ?effectsTheme for details.
Code
library(smss)theme_minimal()
A
The prediction equation for the following three subsections is: ŷ = −10,536 + 53.8x1 + 2.84x.
Code
x <-sum(-10536+ (53.8*1240) + (2.84*18000))y <-145000y - x
[1] 37704
The predicted sale price is $107,296.That is a difference (residual) of $37,704.
B
Code
x <-sum(-10536+ (53.8*500) + (2.84*1000))y <-sum(-10536+ (53.8*501) + (2.84*1000))y - x
[1] 53.8
For a fixed lot size, the house selling price is predicted to increase $53.80 per each square-foot increase. This is because we are multiplying the size of the home (in square feet) by $53.80.
C
Code
sum(53.8/2.84)
[1] 18.94366
Lot size would have to increase by almost 19 square feet to have the same impact as a one-square-foot increase in home size.
Question 2
Code
data(salary)head(salary)
degree rank sex year ysdeg salary
1 Masters Prof Male 25 35 36350
2 Masters Prof Male 13 22 35350
3 Masters Prof Male 10 23 28200
4 Masters Prof Female 7 27 26775
5 PhD Prof Male 19 30 33696
6 Masters Prof Male 16 21 28516
A
Code
t.test(salary ~ sex, salary)
Welch Two Sample t-test
data: salary by sex
t = 1.7744, df = 21.591, p-value = 0.09009
alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
95 percent confidence interval:
-567.8539 7247.1471
sample estimates:
mean in group Male mean in group Female
24696.79 21357.14
While there is a difference in salary by Sex, the p-value does not meet the threshold for statistical significance (0.05). We cannot reject the null hypothesis.
B
Code
salary$sex <-relevel(salary$sex, ref =1)model1 <-lm(salary ~ sex + rank + degree + year + ysdeg, data = salary)confint(model1, level =0.95)
The 95% CI for the male sex’s impact on salary is between -3030.56 to 697.82.
C
Code
summary(model1)
Call:
lm(formula = salary ~ sex + rank + degree + year + ysdeg, data = salary)
Residuals:
Min 1Q Median 3Q Max
-4045.2 -1094.7 -361.5 813.2 9193.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15746.05 800.18 19.678 < 2e-16 ***
sexFemale 1166.37 925.57 1.260 0.214
rankAssoc 5292.36 1145.40 4.621 3.22e-05 ***
rankProf 11118.76 1351.77 8.225 1.62e-10 ***
degreePhD 1388.61 1018.75 1.363 0.180
year 476.31 94.91 5.018 8.65e-06 ***
ysdeg -124.57 77.49 -1.608 0.115
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared: 0.855, Adjusted R-squared: 0.8357
F-statistic: 44.24 on 6 and 45 DF, p-value: < 2.2e-16
Three variables reach statistical significance (low p-value): rankAssoc,rankProf, and year.SexFemale does not reach statistical significance.
Year has the lowest estimate and standard error, about 5x to 10x less than the other two variables. The t-value is less than that of rankProf, but it is still the second highest.
D
Code
salary$rank <-relevel(salary$rank, ref =3)model_relevel_rank <-lm(salary ~ rank + degree + sex + year + ysdeg, data = salary)summary(model_relevel_rank)
Call:
lm(formula = salary ~ rank + degree + sex + year + ysdeg, data = salary)
Residuals:
Min 1Q Median 3Q Max
-4045.2 -1094.7 -361.5 813.2 9193.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 28031.18 1677.06 16.715 < 2e-16 ***
rankAsst -11118.76 1351.77 -8.225 1.62e-10 ***
rankAssoc -5826.40 1012.93 -5.752 7.28e-07 ***
degreePhD 1388.61 1018.75 1.363 0.180
sexMale -1166.37 925.57 -1.260 0.214
year 476.31 94.91 5.018 8.65e-06 ***
ysdeg -124.57 77.49 -1.608 0.115
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared: 0.855, Adjusted R-squared: 0.8357
F-statistic: 44.24 on 6 and 45 DF, p-value: < 2.2e-16
We excluded Assoc from the rank variable in this relevel and included Asst and Prof. This has shown a positive relationship between rank and salary for Prof, but not for Asst, and the variables are statistically significant at the 0.001 scale for both of them. The standard error is also lower for both compared Assoc, though not by much.
E
Code
model2 <-lm(salary ~ degree + sex + year + ysdeg, data = salary)summary(model2)
Call:
lm(formula = salary ~ degree + sex + year + ysdeg, data = salary)
Residuals:
Min 1Q Median 3Q Max
-8146.9 -2186.9 -491.5 2279.1 11186.6
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15897.03 1259.87 12.618 < 2e-16 ***
degreePhD -3299.35 1302.52 -2.533 0.014704 *
sexMale 1286.54 1313.09 0.980 0.332209
year 351.97 142.48 2.470 0.017185 *
ysdeg 339.40 80.62 4.210 0.000114 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3744 on 47 degrees of freedom
Multiple R-squared: 0.6312, Adjusted R-squared: 0.5998
F-statistic: 20.11 on 4 and 47 DF, p-value: 1.048e-09
Excluding rank, we now see that ysdeg is the best predictor variable, with the lowest estimate score, standard error, and meets the 0.001 p-value threshold to be statistically significant.
model3 <-lm(salary ~ dean, data = salary)summary(model3)
Call:
lm(formula = salary ~ dean, data = salary)
Residuals:
Min 1Q Median 3Q Max
-9311.1 -4185.9 -573.6 3931.8 13383.9
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20580 1727 11.913 3.23e-16 ***
deanOld Dean 4082 1945 2.098 0.041 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5729 on 50 degrees of freedom
Multiple R-squared: 0.08091, Adjusted R-squared: 0.06253
F-statistic: 4.402 on 1 and 50 DF, p-value: 0.04097
It looks like those hired prior to the new Dean do have a statistically significant impact on salary, though at a minor code of 0.05. What’s interesting though is the positive relationship, which rejects the belief that the new Dean has been making more generous offers to new hires. However, we should add some control variables, making sure to avoid multicollinearity. Three variables that would impact multicollinearity are rank,ysdeg, and year.
Code
model4 <-lm(salary ~ dean + degree + sex, data = salary)summary(model4)
Call:
lm(formula = salary ~ dean + degree + sex, data = salary)
Residuals:
Min 1Q Median 3Q Max
-8919.1 -4457.4 -344.1 3386.5 15921.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 18088 2159 8.379 5.92e-11 ***
deanOld Dean 4035 1921 2.100 0.0410 *
degreePhD 345 1654 0.209 0.8357
sexMale 3296 1768 1.864 0.0685 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5640 on 48 degrees of freedom
Multiple R-squared: 0.145, Adjusted R-squared: 0.09157
F-statistic: 2.714 on 3 and 48 DF, p-value: 0.05514
The standard error drops, but everything else remains very similar.
Call:
lm(formula = Price ~ Size + New, data = house.selling.price)
Residuals:
Min 1Q Median 3Q Max
-205102 -34374 -5778 18929 163866
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -40230.867 14696.140 -2.738 0.00737 **
Size 116.132 8.795 13.204 < 2e-16 ***
New 57736.283 18653.041 3.095 0.00257 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared: 0.7226, Adjusted R-squared: 0.7169
F-statistic: 126.3 on 2 and 97 DF, p-value: < 2.2e-16
Size is statistically significant (SS) at the 0.001 level, while New is SS at the 0.01 level. Size has a small standard error and larger t-value, while New has a larger estimate, standard error, and lower t-value. We could relevel to see if Old houses had a better impact on Price but that can be observed in other questions.
The predicted selling price for a new 1500 square foot house is $117,667.90 and for an old one is $24,920.53. For each square foot increase, we have a dollar increase of $61.74.
H
I prefer the model that includes taxes. This reduces both the residuals and the standard error for the original variables, Size and New. With a smaller standard error, we should have a more accurate prediction value when attempting to ascertain what the potential sales price for a house is. There is also a larger adjusted R squared percentage. With all that, I would include Taxes in the model for the best prediction value.
Source Code
---title: "Homework 4"author: "Caleb Hill"desription: "The fourth homework on regressions"date: "11/12/2022"format: html: toc: true code-fold: true code-copy: true code-tools: truecategories: - hw4 - regression - prediction---# Question 1First, let's load the relevant libraries and set all the graph themes to minimal.```{r, results='hide'}library(readxl)library(tidyverse)library(dplyr)library(alr4)library(smss)theme_minimal()```## AThe prediction equation for the following three subsections is: ŷ = −10,536 + 53.8x1 + 2.84x.```{r}x <-sum(-10536+ (53.8*1240) + (2.84*18000))y <-145000y - x```The predicted sale price is \$107,296.That is a difference (residual) of \$37,704.## B```{r}x <-sum(-10536+ (53.8*500) + (2.84*1000))y <-sum(-10536+ (53.8*501) + (2.84*1000))y - x```For a fixed lot size, the house selling price is predicted to increase \$53.80 per each square-foot increase. This is because we are multiplying the size of the home (in square feet) by \$53.80.## C```{r}sum(53.8/2.84)```Lot size would have to increase by almost 19 square feet to have the same impact as a one-square-foot increase in home size.# Question 2```{r}data(salary)head(salary)```## A```{r}t.test(salary ~ sex, salary)```While there is a difference in salary by `Sex,` the p-value does not meet the threshold for statistical significance (0.05). We cannot reject the null hypothesis.## B```{r}salary$sex <-relevel(salary$sex, ref =1)model1 <-lm(salary ~ sex + rank + degree + year + ysdeg, data = salary)confint(model1, level =0.95)```The 95% CI for the female sex's impact on salary is between -697.82 to 3030.56.```{r}salary$sex <-relevel(salary$sex, ref =2)model_relevel_sex <-lm(salary ~ sex + rank + degree + year + ysdeg, data = salary)confint(model_relevel_sex, level =0.95)```The 95% CI for the male sex's impact on salary is between -3030.56 to 697.82.## C```{r}summary(model1)```Three variables reach statistical significance (low p-value): `rankAssoc,``rankProf,` and `year.``SexFemale` does not reach statistical significance.`Year` has the lowest estimate and standard error, about 5x to 10x less than the other two variables. The t-value is less than that of `rankProf,` but it is still the second highest.## D```{r}salary$rank <-relevel(salary$rank, ref =3)model_relevel_rank <-lm(salary ~ rank + degree + sex + year + ysdeg, data = salary)summary(model_relevel_rank)```We excluded Assoc from the `rank` variable in this relevel and included Asst and Prof. This has shown a positive relationship between `rank` and `salary` for Prof, but not for Asst, and the variables are statistically significant at the 0.001 scale for both of them. The standard error is also lower for both compared Assoc, though not by much.## E```{r}model2 <-lm(salary ~ degree + sex + year + ysdeg, data = salary)summary(model2)```Excluding rank, we now see that `ysdeg` is the best predictor variable, with the lowest estimate score, standard error, and meets the 0.001 p-value threshold to be statistically significant.## F```{r}salary$dean <-ifelse(salary$ysdeg >='15', "Old Dean","New Dean")table(salary$dean)model3 <-lm(salary ~ dean, data = salary)summary(model3)```It looks like those hired prior to the new Dean do have a statistically significant impact on salary, though at a minor code of 0.05. What's interesting though is the positive relationship, which rejects the belief that the new Dean has been making more generous offers to new hires. However, we should add some control variables, making sure to avoid multicollinearity. Three variables that would impact multicollinearity are `rank,``ysdeg,` and `year.````{r}model4 <-lm(salary ~ dean + degree + sex, data = salary)summary(model4)```The standard error drops, but everything else remains very similar.# Question 3## A```{r}data(house.selling.price)model5 <-lm(Price ~ Size + New, house.selling.price)summary(model5)````Size` is statistically significant (SS) at the 0.001 level, while `New` is SS at the 0.01 level. `Size` has a small standard error and larger t-value, while `New` has a larger estimate, standard error, and lower t-value. We could relevel to see if Old houses had a better impact on `Price` but that can be observed in other questions.```{r}house.selling.price$New <-relevel(factor(house.selling.price$New), ref =1)model_relevel_new <-lm(Price ~ Size + New, house.selling.price)summary(model_relevel_new)```## BThe prediction equation for houses is as follows:New: Predicted Price = -40,230 + 116.13(Size) + 57,736(New).Old: Predicted Price = -40,230 + 116.13(Size) + -57,736(Old).## C```{r}sum(-40230+ (116.13*3000) + (57736*1))sum(-40230+ (116.13*3000) + (-57736*1))```For a new house, the predicted selling price is \$365,896. For an old house, it's \$250,424.## D```{r}model6 <-lm(Price ~ Taxes + Size + New, house.selling.price)summary(model6)```## EThe predicted selling price for the new model for new homes is as follows:New: Predicted Price = -21,353 + 37.23(Taxes) + 61.74(Size) + 46,373.70(New)Old: Predicted Price = -21,353 + 37.23(Taxes) + 61.74(Size) + -46,373.70(Old)## F```{r}sum(-21353+ (37.23*1) + (61.74*3000) + (46373.70*1))sum(-21353+ (37.23*1) + (61.74*3000) + (-46373.70*1))```The predicted selling price for a new 3000 square foot house is \$210,277.90 and for an old house is \$117,530.50.## G```{r}sum(-21353+ (37.23*1) + (61.74*1500) + (46373.70*1))sum(-21353+ (37.23*1) + (61.74*1500) + (-46373.70*1))```The predicted selling price for a new 1500 square foot house is \$117,667.90 and for an old one is \$24,920.53. For each square foot increase, we have a dollar increase of \$61.74.## HI prefer the model that includes taxes. This reduces both the residuals and the standard error for the original variables, `Size` and `New.` With a smaller standard error, we should have a more accurate prediction value when attempting to ascertain what the potential sales price for a house is. There is also a larger adjusted R squared percentage. With all that, I would include `Taxes` in the model for the best prediction value.