HW4

library(smss)

Warning: package 'smss' was built under R version 4.2.2

library(alr4)

Loading required package: car

Loading required package: carData

Loading required package: effects

lattice theme set by effectsTheme()
See ?effectsTheme for details.

library(stats)
library(tidyr)
library(ggplot2)

Question 1

For recent data in Jacksonville, Florida, on y = selling price of home (in dollars), x1 = size of home (in square feet), and x2 = lot size (in square feet), the prediction equation is ŷ = −10,536 + 53.8x1 + 2.84x2.

a) A particular home of 1240 square feet on a lot of 18,000 square feet sold for $145,000. Find the predicted selling price and the residual, and interpret.

predicted_sp <- 10536+53.8*1240+2.84*18000
predicted_sp

[1] 128368

Given the measurements and previous selling price of the house, the predicted selling price is $128,368.

Now to calculate the residual between the above figure and the actual selling price of $145,000.

residual <- 145000-predicted_sp
residual

[1] 16632

The residual is $16,632, which indicates that the home and property sold for $16,632 above the predicted selling price.

b) For fixed lot size, how much is the house selling price predicted to increase for each square-foot increase in home size? Why?

The coefficient for home size in square feet (x1) is 53.8, so the price of the home would go up by $53.80 for each square foot increase.

c) According to this prediction equation, for fixed home size, how much would lot size need to increase to have the same impact as a one-square-foot increase in home size?

The coefficient for x2 (lot size in square feet) is 2.84, meaning that every square foot increase in property size would add $2.84 to the overall price of the home. For an increase in lot size to have the same impact as a one square foot increase in home size, we have to do some simple division (x1/x2)

53.8/2.84

[1] 18.94366

Lot size would have to increase by 18.94 square feet to have the same price impact as a one square foot increase in house size.

Question 2

(Data file: salary in alr4 R package). The data file concerns salary and other characteristics of all faculty in a small Midwestern college collected in the early 1980s for presentation in legal proceedings for which discrimination against women in salary was at issue. All persons in the data hold tenured or tenure track positions; temporary faculty are not included. The variables include degree, a factor with levels PhD and MS; rank, a factor with levels Asst, Assoc, and Prof; sex, a factor with levels Male and Female; Year, years in current rank; ysdeg, years since highest degree, and salary, academic year salary in dollars.

a) Test the hypothesis that the mean salary for men and women is the same, without regard to any other variable but sex. Explain your findings.

data(salary)
t.test(salary~sex,data = salary)


    Welch Two Sample t-test

data:  salary by sex
t = 1.7744, df = 21.591, p-value = 0.09009
alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
95 percent confidence interval:
 -567.8539 7247.1471
sample estimates:
  mean in group Male mean in group Female 
            24696.79             21357.14

The mean salary for males is $24,696.79 and the mean salary for females is $21,357.14. Based entirely on sex, the mean salary for men and women is not the same, with a $3,339.65 difference in favor of males.

b) Run a multiple linear regression with salary as the outcome variable and everything else as predictors, including sex. Assuming no interactions between sex and the other predictors, obtain a 95% confidence interval for the difference in salary between males and females.

lr_predictors <- lm(salary~degree+rank+sex+year+ysdeg, data=salary)
lr_predictors


Call:
lm(formula = salary ~ degree + rank + sex + year + ysdeg, data = salary)

Coefficients:
(Intercept)    degreePhD    rankAssoc     rankProf    sexFemale         year  
    15746.0       1388.6       5292.4      11118.8       1166.4        476.3  
      ysdeg  
     -124.6

set.seed(3)
mf_pay_rate <- data.frame(degree=sample(salary$degree,size=10,replace=T),
                          rank=sample(salary$rank,size=10,replace=T),
                          sex=sample(salary$sex,size=10,replace=T),
                          year=sample(salary$year,size=10,replace=T),
                          ysdeg=sample(salary$ysdeg,size=10,replace=T))
predict(lr_predictors,mf_pay_rate)

       1        2        3        4        5        6        7        8 
15368.63 20671.56 14287.78 29407.55 20069.45 32082.22 23606.80 16837.87 
       9       10 
21524.03 28077.52

summary(lr_predictors)


Call:
lm(formula = salary ~ degree + rank + sex + year + ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 15746.05     800.18  19.678  < 2e-16 ***
degreePhD    1388.61    1018.75   1.363    0.180    
rankAssoc    5292.36    1145.40   4.621 3.22e-05 ***
rankProf    11118.76    1351.77   8.225 1.62e-10 ***
sexFemale    1166.37     925.57   1.260    0.214    
year          476.31      94.91   5.018 8.65e-06 ***
ysdeg        -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16

confint(lr_predictors,'sexFemale')

              2.5 %   97.5 %
sexFemale -697.8183 3030.565

After conducting a multiple linear regression (with 95% CI), the difference in salary between males and females in this scenario ranges from $697.82 less than male counterparts to $3030.56 more than male counterparts.

c) Interpret your finding for each predictor variable; discuss (a) statistical significance, (b) interpretation of the coefficient / slope in relation to the outcome variable and other variables.

set.seed(3)
income_df <-data.frame(degree=sample(salary$degree,size=10,replace=T),
                          rank=sample(salary$rank,size=10,replace=T),
                          sex=sample(salary$sex,size=10,replace=T),
                          year=sample(salary$year,size=10,replace=T),
                          ysdeg=sample(salary$ysdeg,size=10,replace=T))
predict(lr_predictors,income_df)

       1        2        3        4        5        6        7        8 
15368.63 20671.56 14287.78 29407.55 20069.45 32082.22 23606.80 16837.87 
       9       10 
21524.03 28077.52

summary(lr_predictors)


Call:
lm(formula = salary ~ degree + rank + sex + year + ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 15746.05     800.18  19.678  < 2e-16 ***
degreePhD    1388.61    1018.75   1.363    0.180    
rankAssoc    5292.36    1145.40   4.621 3.22e-05 ***
rankProf    11118.76    1351.77   8.225 1.62e-10 ***
sexFemale    1166.37     925.57   1.260    0.214    
year          476.31      94.91   5.018 8.65e-06 ***
ysdeg        -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16

According to this data set, individuals’ salary increases by:
1) $1388.61 if they have a PhD
2) $5292.36 if they are an associate professor
3) $11,118.76 if they are a full/tenured professor
4) $1166.37 if they are female
5) $475.31 every year they are employed at their institution

Salary decreases by $124.57, however, for every year that passes since the individual received their degree. I may have misinterpreted the meaning of this variable (ysdeg), because this doesn’t really make sense if they get a raise for everything else. Anyway, the values representing respondents’ professorial rank and the length of their employment are statistically significant.

d) Change the baseline category for the rank variable. Interpret the coefficients related to rank again.

lr_predictors_rank <- lm(salary~rank+sex+degree+year+ysdeg, data=salary)
set.seed(3) 
income_rank <- data.frame(rank=sample(salary$rank,size=10,replace=T),
                  sex=sample(salary$sex,size=10,replace=T),
                  degree=sample(salary$degree,size=10,replace=T),
                  year=sample(salary$year,size=10,replace=T),
                  ysdeg=sample(salary$ysdeg,size=10,replace=T))
predict(lr_predictors_rank,income_rank)

       1        2        3        4        5        6        7        8 
25098.78 25963.92 15676.39 24969.76 22624.44 26255.82 16925.83 29123.01 
       9       10 
31254.18 28077.52

summary(lr_predictors_rank)


Call:
lm(formula = salary ~ rank + sex + degree + year + ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 15746.05     800.18  19.678  < 2e-16 ***
rankAssoc    5292.36    1145.40   4.621 3.22e-05 ***
rankProf    11118.76    1351.77   8.225 1.62e-10 ***
sexFemale    1166.37     925.57   1.260    0.214    
degreePhD    1388.61    1018.75   1.363    0.180    
year          476.31      94.91   5.018 8.65e-06 ***
ysdeg        -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16

I am not entirely sure this is right since all of the values remained the same.

e) Finkelstein (1980), in a discussion of the use of regression in discrimination cases, wrote, “[a] variable may reflect a position or status bestowed by the employer, in which case if there is discrimination in the award of the position or status, the variable may be ‘tainted.’” Thus, for example, if discrimination is at work in promotion of faculty to higher ranks, using rank to adjust salaries before comparing the sexes may not be acceptable to the courts.

Exclude the variable rank, refit, and summarize how your findings changed, if they did.

lr_predictors3 <- lm(salary~degree+sex+year+ysdeg,data=salary)

set.seed(3)
income_df3 <- data.frame(degree=sample(salary$degree,size=10,replace=T),
                  sex=sample(salary$sex,size=10,replace=T),
                  year=sample(salary$year,size=10,replace=T),
                  ysdeg=sample(salary$ysdeg,size=10,replace=T))
predict(lr_predictors3,income_df3)

       1        2        3        4        5        6        7        8 
15631.50 25840.16 23116.76 29904.74 30290.05 27114.13 19427.73 25588.74 
       9       10 
23098.28 21451.56

summary(lr_predictors3)


Call:
lm(formula = salary ~ degree + sex + year + ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-8146.9 -2186.9  -491.5  2279.1 11186.6 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 17183.57    1147.94  14.969  < 2e-16 ***
degreePhD   -3299.35    1302.52  -2.533 0.014704 *  
sexFemale   -1286.54    1313.09  -0.980 0.332209    
year          351.97     142.48   2.470 0.017185 *  
ysdeg         339.40      80.62   4.210 0.000114 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3744 on 47 degrees of freedom
Multiple R-squared:  0.6312,    Adjusted R-squared:  0.5998 
F-statistic: 20.11 on 4 and 47 DF,  p-value: 1.048e-09

By excluding ‘rank’, salary decreases by:
1) $3299.35 if the individual has a PhD
2) $1286.54 if the individual

However, salary /increases/ by:
1) $351.97 for each year the individual is at their current professorial rank
2) $339.40 every year after the individual receives their PhD

f) Everyone in this dataset was hired the year they earned their highest degree. It is also known that a new Dean was appointed 15 years ago, and everyone in the dataset who earned their highest degree 15 years ago or less than that has been hired by the new Dean. Some people have argued that the new Dean has been making offers that are a lot more generous to newly hired faculty than the previous one and that this might explain some of the variation in Salary.

Create a new variable that would allow you to test this hypothesis and run another multiple regression model to test this. Select variables carefully to make sure there is no multicollinearity. Explain why multicollinearity would be a concern in this case and how you avoided it. Do you find support for the hypothesis that the people hired by the new Dean are making higher than those that were not?

lr_predictor4 <- lm(salary~rank+degree+sex+year+ysdeg+year*ysdeg,data=salary)
summary(lr_predictor4)


Call:
lm(formula = salary ~ rank + degree + sex + year + ysdeg + year * 
    ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4055.4 -1007.7  -172.6   800.0  9275.7 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 16289.506    944.422  17.248  < 2e-16 ***
rankAssoc    5627.056   1184.705   4.750 2.19e-05 ***
rankProf    11475.286   1389.241   8.260 1.71e-10 ***
degreePhD    1557.213   1028.855   1.514   0.1373    
sexFemale    1233.531    925.994   1.332   0.1897    
year          318.343    174.450   1.825   0.0748 .  
ysdeg        -172.406     89.161  -1.934   0.0596 .  
year:ysdeg      7.094      6.578   1.078   0.2867    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2394 on 44 degrees of freedom
Multiple R-squared:  0.8588,    Adjusted R-squared:  0.8363 
F-statistic: 38.22 on 7 and 44 DF,  p-value: < 2.2e-16

The slope for ‘year:ysdeg’ is positive, whcih supports the hypothesis that people hired by the new dean make more than those hired beforehand.

Question 3

a) Using the house.selling.price data, run and report regression results modeling y = selling price (in dollars) in terms of size of home (in square feet) and whether the home is new (1 = yes; 0 = no). In particular, for each variable; discuss statistical significance and interpret the meaning of the coefficient.

Let’s start with presenting summary data to see what we’re working with:

library(smss)
data("house.selling.price")
head(house.selling.price)

  case Taxes Beds Baths New  Price Size
1    1  3104    4     2   0 279900 2048
2    2  1173    2     1   0 146500  912
3    3  3076    4     2   0 237700 1654
4    4  1608    3     2   0 200000 2068
5    5  1454    3     3   0 159900 1477
6    6  2997    3     2   1 499900 3153

summary(house.selling.price)

      case            Taxes           Beds       Baths           New      
 Min.   :  1.00   Min.   :  20   Min.   :2   Min.   :1.00   Min.   :0.00  
 1st Qu.: 25.75   1st Qu.:1178   1st Qu.:3   1st Qu.:2.00   1st Qu.:0.00  
 Median : 50.50   Median :1614   Median :3   Median :2.00   Median :0.00  
 Mean   : 50.50   Mean   :1908   Mean   :3   Mean   :1.96   Mean   :0.11  
 3rd Qu.: 75.25   3rd Qu.:2238   3rd Qu.:3   3rd Qu.:2.00   3rd Qu.:0.00  
 Max.   :100.00   Max.   :6627   Max.   :5   Max.   :4.00   Max.   :1.00  
     Price             Size     
 Min.   : 21000   Min.   : 580  
 1st Qu.: 93225   1st Qu.:1215  
 Median :132600   Median :1474  
 Mean   :155331   Mean   :1629  
 3rd Qu.:169625   3rd Qu.:1865  
 Max.   :587000   Max.   :4050

str(house.selling.price)

'data.frame':   100 obs. of  7 variables:
 $ case : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Taxes: int  3104 1173 3076 1608 1454 2997 4054 3002 6627 320 ...
 $ Beds : int  4 2 4 3 3 3 3 3 5 3 ...
 $ Baths: int  2 1 2 2 3 2 2 2 4 2 ...
 $ New  : int  0 0 0 0 0 1 0 1 0 0 ...
 $ Price: int  279900 146500 237700 200000 159900 499900 265500 289900 587000 70000 ...
 $ Size : int  2048 912 1654 2068 1477 3153 1355 2075 3990 1160 ...

Now I’m going to delve into the regression analysis in which y= selling price (USD) relative to house size, and whether a home is new (1=yes, 0=no).

summary(lm(Price~Size+New,data=house.selling.price))


Call:
lm(formula = Price ~ Size + New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-205102  -34374   -5778   18929  163866 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -40230.867  14696.140  -2.738  0.00737 ** 
Size           116.132      8.795  13.204  < 2e-16 ***
New          57736.283  18653.041   3.095  0.00257 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared:  0.7226,    Adjusted R-squared:  0.7169 
F-statistic: 126.3 on 2 and 97 DF,  p-value: < 2.2e-16

b) Report and interpret the prediction equation, and form separate equations relating selling price to size for new and for not new homes.

In this scenario, y= -40230.87+116.13x1+57736.28x2. x1 represents house size and x2 represents whether the home is new or not. Now I will conduct a multiple regression analysis to show the relationship between a new home’s selling price and its size.

new_sp <- lm(formula=Price~New+Size,data=house.selling.price)
summary(new_sp)


Call:
lm(formula = Price ~ New + Size, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-205102  -34374   -5778   18929  163866 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -40230.867  14696.140  -2.738  0.00737 ** 
New          57736.283  18653.041   3.095  0.00257 ** 
Size           116.132      8.795  13.204  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared:  0.7226,    Adjusted R-squared:  0.7169 
F-statistic: 126.3 on 2 and 97 DF,  p-value: < 2.2e-16

And now a correlation test to go a step further.

cor.test(house.selling.price$Size,house.selling.price$New)


    Pearson's product-moment correlation

data:  house.selling.price$Size and house.selling.price$New
t = 4.1212, df = 98, p-value = 7.891e-05
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.2032530 0.5399831
sample estimates:
      cor 
0.3843277

Predictor variables ‘new’ and ‘size’ have p-values or 0,00257 and 2w-16 respectively. While different, both p-values are statistically significant, which indicates that the null hypothesis can be rejected. By running a correlation test, we can see that the relationship between these 2 variables is weak, as 0.384.

c) Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.

First let’s calculate the predicted selling price of a NEW 3000 sqft house

predicted_spNEW <- -40230.87+116.13*3000+57736
predicted_spNEW

[1] 365895.1

The predicted selling price for a new house of these measurements is $365,895.10

predicted_spOLD <- -40230.87+116.13*3000
predicted_spOLD

[1] 308159.1

The predicted selling price for a NOT new 3000 sqft house is $308,159.10

d) Fit another model, this time with an interaction term allowing interaction between size and new, and report the regression results

Let’s start by calculating this interaction

interaction_size <- summary(lm(Price~Size+New+Size*New, data=house.selling.price))
interaction_size


Call:
lm(formula = Price ~ Size + New + Size * New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-175748  -28979   -6260   14693  192519 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -22227.808  15521.110  -1.432  0.15536    
Size           104.438      9.424  11.082  < 2e-16 ***
New         -78527.502  51007.642  -1.540  0.12697    
Size:New        61.916     21.686   2.855  0.00527 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 52000 on 96 degrees of freedom
Multiple R-squared:  0.7443,    Adjusted R-squared:  0.7363 
F-statistic: 93.15 on 3 and 96 DF,  p-value: < 2.2e-16

With the interaction term connecting ‘size’ and ‘new’, the estimated price increase for each square foot added to a new home is $61.92. The standard error is $21.69 and the t-test value is 2.855.

e) Report the lines relating the predicted selling price to the size for homes that are (i) new, (ii) not new.

ggplot(data=house.selling.price,aes(x=Size,y=Price, color=New))+
  geom_point()+
  geom_smooth(method="lm",se=F)

`geom_smooth()` using formula 'y ~ x'

The scatterplot above shows the relationship between price and house size (sqft) categorized by whether or not the home is new. Newer houses (light blue) have a higher price baseline than those that are “old” (black); most of the older homes are in the bottom left of the graph. There are also a lot more of them. Finally, none of the light blue dots are fully below the regression line, while about half (or more) of those representing older houses are well below the line.

f) Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.

new_house3000 <- -22227.81+104.44*3000-78527.50+61.9*3000*1
new_house3000

[1] 398264.7

The predicted selling price for a new, 3000 sqft home is $398.264.70

older_house3000 <- -22227.81+104.4*3000
older_house3000

[1] 290972.2

The predicted selling price for an older home of the same size is $290,972.20

g) Find the predicted selling price for a home of 1500 square feet that is (i) new, (ii) not new. Comparing to (F), explain how the difference in predicted selling prices changes as the size of home increases.

new_house1500 <- -22227.81+104.4*1500-78527.50+61.9*1500
new_house1500

[1] 148694.7

The predicted selling price for a new, 1500sqft house is $148,694.70

older_house1500 <- -22227.81+104.4*1500
older_house1500

[1] 134372.2

The predicted selling price for an older, 1500sqft home is $134,372.20

Now to compare to part f. As size doubles in a new home the price increases by more than double, rising to $398,264.70 from $148,694.70 in new homes. For older homes, the same is also true, but with a lower range of values, rising to $290,972.20 from $134,372.20. This heavily indicates that size is much more valuable than other aspects of a given property.

h) Do you think the model with interaction or the one without it represents the relationship of size and new to the outcome price? What makes you prefer one model over another?

I think both models could work, but if you’re looking specifically at the relationship between ‘size’ and ‘new’ and the selling price, I think the one without the interaction is better. They don’t necessarily need to interact with each other in order to show their relationship to the price of the home.