Homework 4

hw4

Author

Ethan Campbell

Published

November 14, 2022

Question 1

Code

library(alr4)

Warning: package 'alr4' was built under R version 4.1.3

Loading required package: car

Warning: package 'car' was built under R version 4.1.3

Loading required package: carData

Loading required package: effects

Warning: package 'effects' was built under R version 4.1.3

lattice theme set by effectsTheme()
See ?effectsTheme for details.

Code

library(smss)

Warning: package 'smss' was built under R version 4.1.3

Code

library(tidyverse)

Warning: package 'tidyverse' was built under R version 4.1.3

-- Attaching packages --------------------------------------- tidyverse 1.3.2 --

v ggplot2 3.3.6     v purrr   0.3.4
v tibble  3.1.8     v dplyr   1.0.9
v tidyr   1.2.0     v stringr 1.4.1
v readr   2.1.2     v forcats 0.5.2

Warning: package 'ggplot2' was built under R version 4.1.3

Warning: package 'tibble' was built under R version 4.1.3

Warning: package 'tidyr' was built under R version 4.1.3

Warning: package 'readr' was built under R version 4.1.3

Warning: package 'dplyr' was built under R version 4.1.3

Warning: package 'stringr' was built under R version 4.1.3

Warning: package 'forcats' was built under R version 4.1.3

-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
x dplyr::recode() masks car::recode()
x purrr::some()   masks car::some()

For recent data in Jacksonville, Florida, on y = selling price of home (in dollars), x1 = size of home (in square feet), and x2 = lot size (in square feet), the prediction equation is

y^ = ???10,536 + 53.8x1 + 2.84x2.

A particular home of 1240 square feet on a lot of 18,000 square feet sold for $145,000. Find the predicted selling price and the residual, and interpret.

Code

Actual_price <- 145000
Predicted <- -10536 +(53.8*1240) + (2.84*18000)

residual <- Actual_price - Predicted

Question_3 <- 53.8/2.84

Answer:

The predicted equation is above and with the given information for x1 and x2 and Y we are able to calculate by inputting the information. The residual is the difference between the actual and predicted here we test for the predicted which is with the information inputted and we test for what the cost should be. Afterward, we subtract that from the actual price and notice a difference of -37704. This means that our equation is underpredicting the value of homes and this could be due to either the lack of data or the lack of certain variables that could sway the outcome.

For fixed lot size, how much is the house selling price predicted to increase for each square-foot increase in home size? Why?

Answer: If lot size is fixed then we can ignore x2 and focus on x1. x1 is in charge of home size per square foot and here it is at 53.8 dollar increase per square foot. When the prediction was run this was the coefficient thus we can use this to predict other values based on the same information.

According to this prediction equation, for fixed home size, how much would lot size need to increase to have the same impact as a one-square-foot increase in home size?

Answer: For fixed home size to have the same impact as one square foot increase in home size we would need to increase by 18.94. This is calculated by dividing x1 by x2 and this shows the difference in each rate so for them to have the same impact we would need to multiply it by the above number.

Question 2

(Data file: salary in alr4 R package). The data file concerns salary and other characteristics of all faculty in a small Midwestern college collected in the early 1980s for presentation in legal proceedings for which discrimination against women in salary was at issue. All persons in the data hold tenured or tenure track positions; temporary faculty are not included. The variables include degree, a factor with levels PhD and MS; rank, a factor with levels Asst, Assoc, and Prof; sex, a factor with levels Male and Female; Year, years in current rank; ysdeg, years since highest degree, and salary, academic year salary in dollars.

Test the hypothesis that the mean salary for men and women is the same, without regard to any other variable but sex. Explain your findings.

H_0A

Mean salary for men and women is the same.

H_1A

Mean salary for men and women is not the same.

Code

regression <- lm(salary ~ sex, data = salary)
summary(regression)


Call:
lm(formula = salary ~ sex, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-8602.8 -4296.6  -100.8  3513.1 16687.9 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)    24697        938  26.330   <2e-16 ***
sexFemale      -3340       1808  -1.847   0.0706 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5782 on 50 degrees of freedom
Multiple R-squared:  0.0639,    Adjusted R-squared:  0.04518 
F-statistic: 3.413 on 1 and 50 DF,  p-value: 0.0706

Answer: Based on this alone we do not have enough information to reject the null hypothesis as the p-value for salary by sex is .0706. This is not statistically significant and does not give me any evidence to reject the null. However, when looking into the coefficient based on sex we notice it is -3340 lower than its sex counterpart. Based on this information we can assume that there is something going on and that women may be getting paid less. More data and more testing would be needed to prove this and more variables would prove essential to proving this hypothesis.

Run a multiple linear regression with salary as the outcome variable and everything else as predictors, including sex. Assuming no interactions between sex and the other predictors, obtain a 95% confidence interval for the difference in salary between males and females.

Code

regression2 <- lm(salary ~ ., data = salary)
summary(regression2)


Call:
lm(formula = salary ~ ., data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 15746.05     800.18  19.678  < 2e-16 ***
degreePhD    1388.61    1018.75   1.363    0.180    
rankAssoc    5292.36    1145.40   4.621 3.22e-05 ***
rankProf    11118.76    1351.77   8.225 1.62e-10 ***
sexFemale    1166.37     925.57   1.260    0.214    
year          476.31      94.91   5.018 8.65e-06 ***
ysdeg        -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16

When looking at the data here we notice that p value has grown more becoming harder to reject the null but we also notice a significant shift in the coefficient as it has grown by over 4000 and is now 1166.37. I am assuming this could be related to the other variables shifting this information.

Interpret your finding for each predictor variable; discuss (a) statistical significance, (b) interpretation of the coefficient/slope in relation to the outcome variable and other variables

degreePHd * p-value = .180 * interpretation = causes an increase in salary of 1388.61 this would make sense as having a higher education is known to lead to an increase in salary. However, it is not significant here which hints to why the number is not as high this could be due to them favoring years of service more than education.

rankAssoc * p-value = 3.22e-05 * interpretation = This value is very significant and we are able to reject any null and say yes this value has an impact. This values causes salary to increase by 5292.36 which shows the importance of this value.

rankProf * p-value = 1.62e-10 * interpretation = This value is significant and we can reject any null hypothesis and say yes it has an impact. This one has the largest impact out of all the variables it causes salary to increase by 11118.73 which is a major increase as it it greater than all the other variables combined.

sexFemale * p-value = .214 * interpretation = This value is not significant and we would fail to reject the null hypothesis. This one has a coefficient of 1166.37 which is how much it would increase salary. This is saying that females have a higher salary by that amount if this was significant and included all other variables.

year * p-value = 8.65e-06 *n interpretation = This is a significant value that would allow us to reject the null hypothesis. This means the more you work there the more your salary increases and this is saying that it would increase your salary by 476.31 each year you work there.

ysdeg* p-value = .115 * interpretation = This value is not significant and would fail to reject the null. This value would causes a decrease in salary by 124.57.

Change the baseline category for the rank variable. Interpret the coefficients related to rank again.

Code

salary$rank <- relevel(salary$rank, ref ='Assoc')

regression3 <- lm(salary ~ ., data = salary)
summary(regression3)


Call:
lm(formula = salary ~ ., data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 21038.41    1109.12  18.969  < 2e-16 ***
degreePhD    1388.61    1018.75   1.363    0.180    
rankAsst    -5292.36    1145.40  -4.621 3.22e-05 ***
rankProf     5826.40    1012.93   5.752 7.28e-07 ***
sexFemale    1166.37     925.57   1.260    0.214    
year          476.31      94.91   5.018 8.65e-06 ***
ysdeg        -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16

When changing the base value for rank we notice a large shift in the value. Being an asst. reduces the salary by 5292.36. This is noticed in the coefficients and comparing it to the professor’s salary. This has a p value of 3.22e-05 meaning we could reject the null and say it has an impact.

Finkelstein (1980), in a discussion of the use of regression in discrimination cases, wrote, “[a] variable may reflect a position or status bestowed by the employer, in which case if there is discrimination in the award of the position or status, the variable may be ‘tainted.’” Thus, for example, if discrimination is at work in the promotion of faculty to higher ranks, using rank to adjust salaries before comparing the sexes may not be acceptable to the courts.

Exclude the variable rank, refit, and summarize how your findings changed, if they did.

Code

regression4 <- lm(salary ~ sex + year + degree + ysdeg, data = salary)
summary(regression4)


Call:
lm(formula = salary ~ sex + year + degree + ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-8146.9 -2186.9  -491.5  2279.1 11186.6 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 17183.57    1147.94  14.969  < 2e-16 ***
sexFemale   -1286.54    1313.09  -0.980 0.332209    
year          351.97     142.48   2.470 0.017185 *  
degreePhD   -3299.35    1302.52  -2.533 0.014704 *  
ysdeg         339.40      80.62   4.210 0.000114 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3744 on 47 degrees of freedom
Multiple R-squared:  0.6312,    Adjusted R-squared:  0.5998 
F-statistic: 20.11 on 4 and 47 DF,  p-value: 1.048e-09

When analyzing the changes after removing rank from the equation notice major changes to the p-values of all variables and to their coefficients. We notice that sex female is back to being negative and the p value has increased to .332209. Year has decreased to 351.97 and is significant a 95% only. degreePHd is now hugely negative which was once positive. This is now -3299.35 and is significant at 95%. ysdeg is the only variable that has seen some growth and it is now 339.40 with a p-value of 0.000114. Very interesting to see how much impact the year had on all other variables in this equation.

Everyone in this dataset was hired the year they earned their highest degree. It is also known that a new Dean was appointed 15 years ago, and everyone in the dataset who earned their highest degree 15 years ago or less than that has been hired by the new Dean. Some people have argued that the new Dean has been making offers that are a lot more generous to newly hired faculty than the previous one and that this might explain some of the variations in Salary.

Code

# we need to make a dummy variable for the dean
salary <- salary %>%
  mutate(new_dean = case_when(
    ysdeg <= 15 ~ 1,
    ysdeg > 15 ~ 0
  ))

# now that we have that we can run the regression using an interaction term here to look into the impact of the dean change
dean_regresion <- lm(salary ~ year + degree + sex + rank + new_dean + new_dean*year, data = salary)

summary(dean_regresion)


Call:
lm(formula = salary ~ year + degree + sex + rank + new_dean + 
    new_dean * year, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-3309.8 -1102.5  -265.2   539.2  9339.4 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   17427.11    1538.08  11.330 1.24e-14 ***
year            495.72      97.42   5.088 7.20e-06 ***
degreePhD      1161.63     859.23   1.352   0.1833    
sexFemale      1115.96     862.06   1.295   0.2022    
rankAsst      -5416.07    1079.72  -5.016 9.14e-06 ***
rankProf       6196.73    1029.38   6.020 3.16e-07 ***
new_dean       3789.08    1867.72   2.029   0.0486 *  
year:new_dean  -195.43     183.99  -1.062   0.2940    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2358 on 44 degrees of freedom
Multiple R-squared:  0.8629,    Adjusted R-squared:  0.8411 
F-statistic: 39.58 on 7 and 44 DF,  p-value: < 2.2e-16

To avoid some problems with similar information we made an interaction term so we can compare when the new dean came effectively. Now when reading this we notice that at the 95% confidence level the new_Dean which means that being hired by the new dean does result in some higher wages. The new dean variable by itself had an increase of 3789.08 however when we take into consideration the interaction term it brings that number down by 195.43.

Create a new variable that would allow you to test this hypothesis and run another multiple regression model to test this. Select variables carefully to make sure there is no multicollinearity. Explain why multicollinearity would be a concern in this case and how you avoided it. Do you find support for the hypothesis that the people hired by the new Dean are making higher than those that were not?

Question 3

(Data file: house.selling.price in smss R package)

Using the house.selling.price data, run and report regression results modeling y = selling price (in dollars) in terms of size of the home (in square feet) and whether the home is new(1 = yes; 0 = no). In particular, for each variable; discuss the statistical significance and interpret the meaning of the coefficient.

Code

data("house.selling.price")

house <- house.selling.price

Reg1 <- lm(Price ~ Size + New, data = house)
summary(Reg1)


Call:
lm(formula = Price ~ Size + New, data = house)

Residuals:
    Min      1Q  Median      3Q     Max 
-205102  -34374   -5778   18929  163866 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -40230.867  14696.140  -2.738  0.00737 ** 
Size           116.132      8.795  13.204  < 2e-16 ***
New          57736.283  18653.041   3.095  0.00257 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared:  0.7226,    Adjusted R-squared:  0.7169 
F-statistic: 126.3 on 2 and 97 DF,  p-value: < 2.2e-16

Here there are two variables to look at first is Size, this coefficient is at 116.132 meaning for every sqft that is added the price goes up by this. The other one is new which increases the price of the home by 57736.283 if the home is new. Size is very significant at < 2e-16 while new is significant at 0.00257 both of these variables are significant to predicting house cost. With these two variables alone we see an R squared of .726 which is really good.

Report and interpret the prediction equation, and form separate equations relating selling price to size for new and for not new homes.

Prediction equation is: Y = -40230.867 +116.132x1 +17505.41x2 This is where x1 = size and x2 = new when it is yes

Prediction equation is: Y = -40230.867 +116.132x1 +-40230.867x2

This is where x1 = size and x2 = new when it is no

Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.

new <- (116.132*3000) + 17505.41
old <- (116.132*3000) -40320.867

New is valued at 365,901.41 while old is valued at 308,075.133

Fit another model, this time with an interaction term allowing interaction between size and new, and report the regression results

Code

Reg2 <- lm(Price ~ Size + New + New*Size, data = house)
summary(Reg2)


Call:
lm(formula = Price ~ Size + New + New * Size, data = house)

Residuals:
    Min      1Q  Median      3Q     Max 
-175748  -28979   -6260   14693  192519 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -22227.808  15521.110  -1.432  0.15536    
Size           104.438      9.424  11.082  < 2e-16 ***
New         -78527.502  51007.642  -1.540  0.12697    
Size:New        61.916     21.686   2.855  0.00527 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 52000 on 96 degrees of freedom
Multiple R-squared:  0.7443,    Adjusted R-squared:  0.7363 
F-statistic: 93.15 on 3 and 96 DF,  p-value: < 2.2e-16

the regression results show that with size and the new the cost per sqft is 166.354 which is 61.916 higher compared to the not new version. This is the focus of this regression is to evaluate the interaction term

Report the lines relating the predicted selling price to the size for homes that are (i) new, (ii) not new.

Code

#mod <- -22227.81 + 104.44x1 -78527.50x2 + 61.92(x1*z)

#old1 <- -22227.81 + (104.44*x1)
#new1 <- -100755.31 + (166.36x1)

Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.

Code

old1 <- -22227.81 + (104.44*3000)
new1 <- -100755.31 + (166.36*3000)

For a new home with this interaction term, we see a price of 398324.69 and for an old we see 291092.19 which is a huge difference.

Find the predicted selling price for a home of 1500 square feet that is (i) new, (ii) not new. Comparing to (F), explain how the difference in predicted selling prices changes as the size of the home increases.

Code

old2 <- -22227.81 + (104.44*1500)
new2 <- -100755.31 + (166.36*1500)

Here we see the price of a new home is 148,784.69 at 1500 sqft while an old one is 134,432.19. The difference between these values is much smaller than it was last time and that falls to the slope. There is a difference of 61.916 between the new and old so the larger the home the larger the difference will be between the two.

Do you think the model with interaction or the one without it represents the relationship of size and new to the outcome price? What makes you prefer one model over another?

When analyzing these two regressions I would take the second one with the interaction term as it accounts for the new variable in each sqft compared to the other one that had one static number which does not seem realistic. Another point to note is the difference in the r squared between the two regressions as the one with the interaction term is higher by ~ 2%.