hw4
regression
Author

Donny Snyder

Published

November 14, 2022

Code
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.3.6      ✔ purrr   0.3.5 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.4.1 
✔ readr   2.1.3      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Code
library(alr4)
Loading required package: car
Loading required package: carData

Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some

Loading required package: effects
lattice theme set by effectsTheme()
See ?effectsTheme for details.
Code
library(smss)
Warning: package 'smss' was built under R version 4.2.2
Code
library(dplyr)
library(ggplot2)
library(GGally)
Error in library(GGally): there is no package called 'GGally'

Question 1

Code
predictSell <- -10536 + (53.8 * 1240) + (2.84 * 18000)
realSell <- 145000
residual <- realSell - predictSell
ratioLottoHome <- 53.8/2.84

#Question 1A The predicted selling price is 107,296 dollars, the residual is 37,704 dollars. The predicted selling price undershot the actual selling price.

#Question 1B With a fixed lot size, the house selling price will increase by 53.8 for each square foot. This is because as the house is getting bigger, the house is selling for more, because the house is more valuable than the empty lot space.

#Question 1C The lot would have to increase ~18.94 square feet to have the same impact as a one square foot increase size in the home.

#Question 2

Code
data <- salary

model1 <- lm(salary~sex, data = data)
summary(model1)

Call:
lm(formula = salary ~ sex, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-8602.8 -4296.6  -100.8  3513.1 16687.9 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)    24697        938  26.330   <2e-16 ***
sexFemale      -3340       1808  -1.847   0.0706 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5782 on 50 degrees of freedom
Multiple R-squared:  0.0639,    Adjusted R-squared:  0.04518 
F-statistic: 3.413 on 1 and 50 DF,  p-value: 0.0706
Code
model2 <- lm(salary~degree+rank+sex+year+ysdeg, data = data)
summary(model2)

Call:
lm(formula = salary ~ degree + rank + sex + year + ysdeg, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 15746.05     800.18  19.678  < 2e-16 ***
degreePhD    1388.61    1018.75   1.363    0.180    
rankAssoc    5292.36    1145.40   4.621 3.22e-05 ***
rankProf    11118.76    1351.77   8.225 1.62e-10 ***
sexFemale    1166.37     925.57   1.260    0.214    
year          476.31      94.91   5.018 8.65e-06 ***
ysdeg        -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16
Code
confint(model2)
                 2.5 %      97.5 %
(Intercept) 14134.4059 17357.68946
degreePhD    -663.2482  3440.47485
rankAssoc    2985.4107  7599.31080
rankProf     8396.1546 13841.37340
sexFemale    -697.8183  3030.56452
year          285.1433   667.47476
ysdeg        -280.6397    31.49105
Code
data$rank <- factor(data$rank, levels = c("Prof", "Assoc", "Asst"))

model3 <- lm(salary~degree+rank+sex+year+ysdeg, data = data)
summary(model3)

Call:
lm(formula = salary ~ degree + rank + sex + year + ysdeg, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  26864.81    1375.29  19.534  < 2e-16 ***
degreePhD     1388.61    1018.75   1.363    0.180    
rankAssoc    -5826.40    1012.93  -5.752 7.28e-07 ***
rankAsst    -11118.76    1351.77  -8.225 1.62e-10 ***
sexFemale     1166.37     925.57   1.260    0.214    
year           476.31      94.91   5.018 8.65e-06 ***
ysdeg         -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16
Code
model4 <- lm(salary~degree+sex+year+ysdeg, data = data)
summary(model4)

Call:
lm(formula = salary ~ degree + sex + year + ysdeg, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-8146.9 -2186.9  -491.5  2279.1 11186.6 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 17183.57    1147.94  14.969  < 2e-16 ***
degreePhD   -3299.35    1302.52  -2.533 0.014704 *  
sexFemale   -1286.54    1313.09  -0.980 0.332209    
year          351.97     142.48   2.470 0.017185 *  
ysdeg         339.40      80.62   4.210 0.000114 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3744 on 47 degrees of freedom
Multiple R-squared:  0.6312,    Adjusted R-squared:  0.5998 
F-statistic: 20.11 on 4 and 47 DF,  p-value: 1.048e-09
Code
data$dean <- NA

x = 1
while(x < 53){
  if(data$ysdeg[x] > 15){
    data$dean[x] = 0
  }
  else{
    data$dean[x] = 1
  }
 x = x + 1   
}

model5 <- lm(data = data, salary~sex+year+dean+degree)
summary(model5)

Call:
lm(formula = salary ~ sex + year + dean + degree, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-10740.1  -2550.1     -3.3   1942.4  11718.3 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  22598.7     1792.1  12.610  < 2e-16 ***
sexFemale     -523.5     1355.1  -0.386 0.701017    
year           531.4      130.2   4.082 0.000172 ***
dean         -4449.8     1347.2  -3.303 0.001834 ** 
degreePhD    -1186.6     1191.2  -0.996 0.324267    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3958 on 47 degrees of freedom
Multiple R-squared:  0.5878,    Adjusted R-squared:  0.5527 
F-statistic: 16.75 on 4 and 47 DF,  p-value: 1.338e-08

Question 2A

As shown in model1, it seems like the mean salary is the same, as the null hypothesis that there is no difference cannot be rejected with the p-value of 0.0706.

Question 2B

The 95% confidence interval for pay differences between males and females is between -697.8183 and 3030.56452.

Question 2C

There is demonstrated statistically significant evidence that rank and years in current rank show significant results on an increase in salary. Rank as a professor shows the strongest effect, with rank as an associate professor also showing a high effect, as well as years in current rank. Rank has the highest slope for statistically significantly raising salary, particularly among full professors. All other relationships are not statistically significant.

Question 2D

Changing the baseline changes the direction of the relationship for associate and assistant professors. It shows that relative to full professors, associate, and assistant professors receive less.

Question 2E

Excluding rank makes the degreePhD and years after degree variables more important, as rank was likely previously explaining their variance. Without being able to rely on rank, these serve as proxies for professors receiving more money due to being a higher rank. however, there is also no statistically significant of discrimination based on sex.

Question 2F

There is actually some evidence that the dean has been making less generous offers than previously, per model 5. Multicollinearity can make it harder to make inferences. I avoided it by excluding degree and years after degree variables, as rank explains salary better than those.

Question 3

Code
summary(house.selling.price)
Error in summary(house.selling.price): object 'house.selling.price' not found
Code
data2 <- house.selling.price
Error in eval(expr, envir, enclos): object 'house.selling.price' not found
Code
model1 <- lm(data = data2, Price~Size+New)
Error in is.data.frame(data): object 'data2' not found
Code
summary(model1)

Call:
lm(formula = salary ~ sex, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-8602.8 -4296.6  -100.8  3513.1 16687.9 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)    24697        938  26.330   <2e-16 ***
sexFemale      -3340       1808  -1.847   0.0706 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5782 on 50 degrees of freedom
Multiple R-squared:  0.0639,    Adjusted R-squared:  0.04518 
F-statistic: 3.413 on 1 and 50 DF,  p-value: 0.0706
Code
predictNew <- (116.132*3000) + 57736.283*1 - 40230.867

predictNotNew <- (116.132*3000) + 57736.283*0 - 40230.867

model2 <- lm(data = data2, Price~Size+New+Size*New)
Error in is.data.frame(data): object 'data2' not found
Code
summary(model2)

Call:
lm(formula = salary ~ degree + rank + sex + year + ysdeg, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 15746.05     800.18  19.678  < 2e-16 ***
degreePhD    1388.61    1018.75   1.363    0.180    
rankAssoc    5292.36    1145.40   4.621 3.22e-05 ***
rankProf    11118.76    1351.77   8.225 1.62e-10 ***
sexFemale    1166.37     925.57   1.260    0.214    
year          476.31      94.91   5.018 8.65e-06 ***
ysdeg        -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16
Code
predictNew2 <- 104.438*3000 - 78527.502 + 61.916*3000 - 22227.808

predictNotNew2 <- 104.438*3000 - 22227.808

predictNew3 <- 104.438*1500 - 78527.502 + 61.916*1500 - 22227.808

predictNotNew3 <- 104.438*1500 - 22227.808

Question 3A

Size of home and how new the home is both statistically significantly increase the selling prices of homes.

Question 3B

New Home Price = 116.132(size in square feet) + 57736.283(if New) - 40230.867

Not New Home Price = 116.132(size in square feet) - 40230.867

Question 3C

New Home Prediction = 365901.416 No New Home Prediction = 308165.133

Question 3D

It seems like Size and New have a positive interaction effect. Being a new home and being larger are interrelated. This also removes the statistical significance of being new.

Question 3E

New home Prediction = 104.438(size in square feet) - 78527.502(if new) + 61.916(new times size) - 22227.808

Not new home Prediction = 104.438(size in square feet) + 22227.808

Question 3F

New Home Prediction = 398306.69

No New Home Prediction = 291086.192

Question 3G

New Home Prediction = 148775.69

No New Home Prediction = 134429.192

As size of home gets smaller, newness tends to matter less towards increasing the predicted selling price, so the disparity between new and not new homes tends to decrease.

Question 3H

I think that I prefer the interaction model. It is more responsive to size than just merely adding over 50k for being either new or not new, in a binary. It’s good to have more of a gradient measure like this interaction provides in multiple ways.