hw2
confidence intervals
DACSS603_Homework 4
Author

Meredith Derian-Toth

Published

April 30, 2023

Code
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
Code
library("alr4")
library("smss")
library("ggplot2")

Question 1.

For recent data in Jacksonville, Florida, on y = selling price of home (in dollars), x1 = size of home (in square feet), and x2 = lot size (in square feet), the prediction equation is ŷ = −10,536 + 53.8x1 + 2.84x2.

a) A particular home of 1240 square feet on a lot of 18,000 square feet sold for $145,000. Find the predicted selling price and the residual, and interpret.

Code
#ŷ = −10,536 + 53.8x1 + 2.84x2
#x1 = size of home
#x2 = size of lot
#y = selling price

x1<-1240
x2<-18000

Predict_Sell_Price<-(-10536) + (53.8*x1) + (2.84*x2)
print(Predict_Sell_Price)
[1] 107296
Code
residual1a<-145000-Predict_Sell_Price
print(residual1a)
[1] 37704

The predicted selling price is $107,296, and the residual is 37704. The residual is positive, which means the price could go as high as 37,704 + 107,296. In other words, the predicted price is an underestimate.

b) For fixed lot size, how much is the house selling price predicted to increase for each square-foot increase in home size? Why?

Looking at the equation: ŷ = −10,536 + 53.8x1 + 2.84x2 where x1 = size of home, x2 = size of lot, and y = selling price. When x2 remains constant, the predicted selling price will increase by x1, which is 53.8. This is because 53.8 is the coefficient of the variable that is not fixed.

c) According to this prediction equation, for fixed home size, how much would lot size need to increase to have the same impact as a one-square-foot increase in home size?

Code
print(53.8/2.84)
[1] 18.94366

Lot size would need to increase by almost 19 times itself in order to have the same impact as a one-square foot increase in home size.

Question 2.

(Data file: salary in alr4 R package). The data file concerns salary and other characteristics of all faculty in a small Midwestern college collected in the early 1980s for presentation in legal proceedings for which discrimination against women in salary was at issue. All persons in the data hold tenured or tenure track positions; temporary faculty are not included. The variables include:

1) degree - a factor with levels PhD and MS.
2) rank - a factor with levels Asst, Assoc, and Prof.
3) sex - a factor with levels Male and Female.
4) Year - years in current rank.
5) ysdeg - years since highest degree.
6) salary - academic year salary in dollars.

Code
data(salary)
head(salary)
   degree rank    sex year ysdeg salary
1 Masters Prof   Male   25    35  36350
2 Masters Prof   Male   13    22  35350
3 Masters Prof   Male   10    23  28200
4 Masters Prof Female    7    27  26775
5     PhD Prof   Male   19    30  33696
6 Masters Prof   Male   16    21  28516

a) Test the hypothesis that the mean salary for men and women is the same, without regard to any other variable but sex. Explain your findings.

Code
summary(lm(salary~sex, data = salary))

Call:
lm(formula = salary ~ sex, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-8602.8 -4296.6  -100.8  3513.1 16687.9 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)    24697        938  26.330   <2e-16 ***
sexFemale      -3340       1808  -1.847   0.0706 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5782 on 50 degrees of freedom
Multiple R-squared:  0.0639,    Adjusted R-squared:  0.04518 
F-statistic: 3.413 on 1 and 50 DF,  p-value: 0.0706

When looking at just sex and salary, there is a difference in average salary between male and female faculty members. Female faculty members make on average $3,340 less than their male colleagues. This difference is not statistically significant as the p value is above .05 ( p value = 0.0706).

b) Run a multiple linear regression with salary as the outcome variable and everything else as predictors, including sex. Assuming no interactions between sex and the other predictors, obtain a 95% confidence interval for the difference in salary between males and females.

Code
lm(salary ~ sex + degree + rank + year + ysdeg, data = salary) |>
          confint()
                 2.5 %      97.5 %
(Intercept) 14134.4059 17357.68946
sexFemale    -697.8183  3030.56452
degreePhD    -663.2482  3440.47485
rankAssoc    2985.4107  7599.31080
rankProf     8396.1546 13841.37340
year          285.1433   667.47476
ysdeg        -280.6397    31.49105

The 95% confidence interval for the difference in salary between men and women is -697.81 to 3030.56. This means a woman’s salary could be as low as 697.81 less than a male’s or as high as 3030.56 more than a male’s.

c) Interpret your finding for each predictor variable; discuss (a) statistical significance, (b) interpretation of the coefficient / slope in relation to the outcome variable and other variables

Code
summary(lm(salary ~ sex + degree + rank + year + ysdeg, data = salary))

Call:
lm(formula = salary ~ sex + degree + rank + year + ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 15746.05     800.18  19.678  < 2e-16 ***
sexFemale    1166.37     925.57   1.260    0.214    
degreePhD    1388.61    1018.75   1.363    0.180    
rankAssoc    5292.36    1145.40   4.621 3.22e-05 ***
rankProf    11118.76    1351.77   8.225 1.62e-10 ***
year          476.31      94.91   5.018 8.65e-06 ***
ysdeg        -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16

According to the results of the regression:
(1) sexFemale has a positive coeficient of 1166.37 and a nonsigificant p value of 0.214. This means that females tend to have higher salarys with an average of $1,166 more than men. And that being female does not have a significant affect on salary.
(2) degreePhD has a positive coefficient of 1388.61 and nonsignificant p value of 0.180. This means, those with a PhD have a higher salary on average by $1,388 than those who have a MA. It also means that having a PhD does not affect salary significantly.
(3) rankAssoc has a positive coefficient of 5292.36 and a significant p value of 3.22e-05. This means, tenure faculty members who rank as an Assoc have higher salaries in general by $5,292, and the relationship between professor ranking and salary is significant.
(4) rankProf has a positive coefficient of 11118.76 and a significant p value of 1.62e-10. This means, those who are ranked as professors make significantly more money than those of other rankings (by about $11,118).
(5) year has a positive coefficient of 476.31 and a significant p value of 8.65e-06. This means that the longer you are in your ranking, the higher your salary will be (by about $476 per year), and that the relationship between years in ranking and salary is significant.
(6) ysdeg has a negative coefficient with a nonsignificant p value of 0.115. This means that the more years past your education, the less you are getting paid. Which could mean that the newer employees are getting offered more money. However, there is not a significant relationship between years since your education and salary.

e) Finkelstein (1980), in a discussion of the use of regression in discrimination cases, wrote, “[a] variable may reflect a position or status bestowed by the employer, in which case if there is discrimination in the award of the position or status, the variable may be ‘tainted.’”

Thus, for example, if discrimination is at work in promotion of faculty to higher ranks, using rank to adjust salaries before comparing the sexes may not be acceptable to the courts.

Exclude the variable rank, refit, and summarize how your findings changed, if they did.

Code
summary(lm(salary ~ sex + degree + year + ysdeg, data = salary))

Call:
lm(formula = salary ~ sex + degree + year + ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-8146.9 -2186.9  -491.5  2279.1 11186.6 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 17183.57    1147.94  14.969  < 2e-16 ***
sexFemale   -1286.54    1313.09  -0.980 0.332209    
degreePhD   -3299.35    1302.52  -2.533 0.014704 *  
year          351.97     142.48   2.470 0.017185 *  
ysdeg         339.40      80.62   4.210 0.000114 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3744 on 47 degrees of freedom
Multiple R-squared:  0.6312,    Adjusted R-squared:  0.5998 
F-statistic: 20.11 on 4 and 47 DF,  p-value: 1.048e-09

When removing rank, the model changes. The coefficient for sexFemale is now negative, which means women make less in salary than men by about $1,286.54. However, the p value is still not significant and the adjusted R- Square for the model has reduced from .83 to .6.

f) Everyone in this dataset was hired the year they earned their highest degree. It is also known that a new Dean was appointed 15 years ago, and everyone in the dataset who earned their highest degree 15 years ago or less than that has been hired by the new Dean.

Some people have argued that the new Dean has been making offers that are a lot more generous to newly hired faculty than the previous one and that this might explain some of the variation in Salary.

Create a new variable that would allow you to test this hypothesis and run another multiple regression model to test this. Select variables carefully to make sure there is no multicollinearity. Explain why multicollinearity would be a concern in this case and how you avoided it. Do you find support for the hypothesis that the people hired by the new Dean are making higher than those that were not?

Code
salary<-salary%>%
  mutate(New_Dean = case_when(ysdeg <=15 ~ 1,
                   ysdeg > 15 ~ 0))
Error in salary %>% mutate(New_Dean = case_when(ysdeg <= 15 ~ 1, ysdeg > : could not find function "%>%"
Code
MultiCoTest<-cor(salary$ysdeg, salary$New_Dean, method = 'pearson')
Error in cor(salary$ysdeg, salary$New_Dean, method = "pearson"): supply both 'x' and 'y' or a matrix-like 'x'
Code
print(MultiCoTest)
Error in print(MultiCoTest): object 'MultiCoTest' not found
Code
summary(lm(salary ~ sex + degree + rank + year + New_Dean, data = salary))
Error in eval(predvars, data, env): object 'New_Dean' not found

Multi-colinearity could be a problem with this new variable because of the existing variable of “ysdeg”. This variable is the number of years since your highest degree obtained, therefore if the new variable is not different enough from this variable it will cause issues of multicolinearity.

Therefore, I created a dummy variable of “hired by the new dean” which equals to “1” or “not hired by the new dean” which equals to “0”, and then ran tests of multicolineary to test the relationship before running the regression again. The correlation between ysdeg and New_Dean is -0.843, which could mean multicolinearity. Therefore we ran the model without the ysdeg variable.

The results of the new model suggest that those hired by the new dean are making statistically significantly more money by about $2,163, with a p value of .0496. This is just barely statistically significant, however the results support the hypothesis that the new dean is offering higher salaries than the previous one.

Question 3

Code
data(house.selling.price)
head(house.selling.price)
  case Taxes Beds Baths New  Price Size
1    1  3104    4     2   0 279900 2048
2    2  1173    2     1   0 146500  912
3    3  3076    4     2   0 237700 1654
4    4  1608    3     2   0 200000 2068
5    5  1454    3     3   0 159900 1477
6    6  2997    3     2   1 499900 3153

a) Using the house.selling.price data, run and report regression results modeling y = selling price (in dollars) in terms of size of home (in square feet) and whether the home is new (1 = yes; 0 = no). In particular, for each variable: discuss statistical significance and interpret the meaning of the coefficient.

Code
summary(lm(Price ~ Size + New, data = house.selling.price))

Call:
lm(formula = Price ~ Size + New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-205102  -34374   -5778   18929  163866 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -40230.867  14696.140  -2.738  0.00737 ** 
Size           116.132      8.795  13.204  < 2e-16 ***
New          57736.283  18653.041   3.095  0.00257 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared:  0.7226,    Adjusted R-squared:  0.7169 
F-statistic: 126.3 on 2 and 97 DF,  p-value: < 2.2e-16

The results of the regression show that the size of the house and whether the house is new both have a statistically significant relationship with the selling price. For the size of the house, the price increases by 116 dollars with every square foot (p value = < 2e-16). And if the house is new it increases the price by an average of $57,736 (p value = 0.00257).

b) Report and interpret the prediction equation, and form separate equations relating selling price to size for new and for not new homes.

Prediction Equation: ŷ = -40230.87 + 116.13x1 + 57736.28x2
Where x1 = Size
And x2 = New (or not new)

For new homes:
ŷ = -40230.87 + 116.13x1 - 57736.28(1)

For not new homes:
ŷ = -40230.87 + 116.13x1 - 57736.28(0) therefore
ŷ = -40230.87 + 116.13x1

c) Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.

Code
x1<-3000
x2<-1
x3<-0

ŷ <- (-40230.87) + (116.13*x1) + (57736.28*x2)
ŷ
[1] 365895.4
Code
ŷ2 <- (-40230.87) + (116.13*x1) + (57736.28*x3)
ŷ2
[1] 308159.1

(i) The predicted selling price of a house that is new and 3000 square feet is $365,895.40.
(ii) The predicted selling price of a house that is not new and 3000 square feet is $308,159.10.

d) Fit another model, this time with an interaction term allowing interaction between size and new, and report the regression results.

Code
summary(lm(Price ~ Size + New + Size * New, data = house.selling.price))

Call:
lm(formula = Price ~ Size + New + Size * New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-175748  -28979   -6260   14693  192519 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -22227.808  15521.110  -1.432  0.15536    
Size           104.438      9.424  11.082  < 2e-16 ***
New         -78527.502  51007.642  -1.540  0.12697    
Size:New        61.916     21.686   2.855  0.00527 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 52000 on 96 degrees of freedom
Multiple R-squared:  0.7443,    Adjusted R-squared:  0.7363 
F-statistic: 93.15 on 3 and 96 DF,  p-value: < 2.2e-16

The results of the regression show that the size of the house as a significant relationship to the selling price with a p value of < 2e-16. The price of the house increases by 104 dollars for every square foot. Whether the house is new is now no longer a significant relationship to the price of the house with a p value of 0.127 and a negative coefficient. This means newer homes decrease the price. However the relationship between size and newness has a significant relationship to selling price with a p value of 0.00527.

e) Report the lines relating the predicted selling price to the size for homes that are (i) new, (ii) not new.

ŷ = -22227.81 + 104.44x1 - 78527.50x2 + 61.92x3

(i) New Homes:
ŷ = -22227.81 + 104.44x1 - 78527.50(1) + 61.92x3

(ii) Not New Homes:
ŷ = -22227.81 + 104.44x1 - 78527.50(0) + 61.92x3 Therefore
ŷ = -22227.81 + 104.44x1 + 61.92x3

f) Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.

Code
#New Home
x1<-3000
x2<-1
x3<-x2*x1

predict_selling_price_fi<- (-22227.81) + (104.44*x1) - (78527.50*x2) + (61.92*x3)
print(predict_selling_price_fi)
[1] 398324.7
Code
#Not New Home
x1<-3000
x4<-0
x3<-x4*x1
predict_selling_price_fii<- (-22227.81) + (104.44*x1) - (78527.50*x4) + (61.92*x3)
print(predict_selling_price_fii)
[1] 291092.2

(i) New home of 3000 square feet has a predicted selling price of $398,324.70
(ii) Not new home of 3000 square feet has a predicted selling price of $291,092.20

g) Find the predicted selling price for a home of 1500 square feet that is (i) new, (ii) not new. Comparing to (F), explain how the difference in predicted selling prices changes as the size of home increases.

Code
#New Home
x1<-1500
x2<-1
x3<-x2*x1

predict_selling_price_fi<- (-22227.81) + (104.44*x1) - (78527.50*x2) + (61.92*x3)
print(predict_selling_price_fi)
[1] 148784.7
Code
#Not New Home
x1<-1500
x4<-0
x3<-x4*x1
predict_selling_price_fii<- (-22227.81) + (104.44*x1) - (78527.50*x4) + (61.92*x3)
print(predict_selling_price_fii)
[1] 134432.2
Code
398324.70/148784.70
[1] 2.677189

(i) New home of 1500 square feet has a predicted selling price of $148784.70
(ii) Not new home of 1500 square feet has a predicted selling price of $134432.20
The price of a home went from 148784.70 dollars for a 1500 square foot house to $398,324.70 for a 3000 square foot house. This is a price increase of 2.7x with 2x increase in square feet. This means that the price increases a higher rate than 1-1 for square feet to dollar. According to the coefficient of the model, 104.438, for every 1 square foot increase the price increases by 104.44 dollars.

h) Do you think the model with interaction or the one without it represents the relationship of size and new to the outcome price? What makes you prefer one model over another?

The Non-interaction Model: (Intercept) -40230.867 Adjusted R-squared: 0.7169 p-value: < 2.2e-16

(i) The predicted selling price of a house that is new and 3000 square feet is $365,895.40.
(ii) The predicted selling price of a house that is not new and 3000 square feet is $308,159.10.

The interaction Model: (Intercept) -22227.808 Adjusted R-squared: 0.7363 p-value: < 2.2e-16

(i) New home of 3000 square feet has a predicted selling price of $398,324.70
(ii) Not new home of 3000 square feet has a predicted selling price of $291,092.20

I believe the second model, with the interaction between “newness” and size is a more predictive model for the selling price. First looking at the adjusted R-squared, it is larger in interaction model (0.7363) compared to the non-interaction model (0.7169). Second, comparing the two models, we see that “New” is significant in the first model (p value of 0.00257), however it is no longer significant when we run the model with the interaction (p value of 0.12697). This means, the interaction of “newness” and size is more predictive than “newness” alone. Lastly, when comparing the predicted selling price for a 3000 square foot house, the interaction model allows for more variation. This would reflect more real world scenarios than the first, non-interactive, model.