Homework 4 - Prahitha Movva

hw4
multiple regression
interaction term
covariance
The fourth homework
Author

Prahitha Movva

Published

November 17, 2022

Code
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.4.1 
✔ readr   2.1.3      ✔ forcats 0.5.2 
Warning: package 'ggplot2' was built under R version 4.2.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Code
library(alr4)
Loading required package: car
Loading required package: carData

Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some

Loading required package: effects
lattice theme set by effectsTheme()
See ?effectsTheme for details.
Code
library(smss)
Warning: package 'smss' was built under R version 4.2.2
Code
knitr::opts_chunk$set(echo=TRUE, warning=FALSE)

Question 1

A

Given prediction equation: y = -10536 + 53.8x1 + 2.84x2 where x1 is the size of the home (sq ft) and x2 is the lot size (sq ft)

Code
lot.size = 18000
home.size = 1240
actual.price = 145000

predicted.price = -10536 + (53.8*home.size) + (2.84*lot.size)
print(predicted.price)
[1] 107296
Code
residual = actual.price - predicted.price
print(residual)
[1] 37704

Using the prediction equation, the value that we get is $107,296 and the residual is $37,704. The residual tells us that the house was under-priced (under-predicted) by $37,704.

B

For a fixed lot size (say k), our prediction equation becomes y = -10536 + 53.8x1 + 2.84k where x1 (home size) is the only variable. So for each square-foot increase in home size, the house selling price predicted by the model increases by $53.8

Code
y1 = -10536 + (53.8*1) + 2.84
y2 = -10536 + (53.8*2) + 2.84

y2 - y1
[1] 53.8

C

For a fixed home size (say k), our prediction equation becomes y = -10536 + 53.8k + 2.84x2 where x2 (lot size) is the only variable. So for each square-foot increase in lot size, the house selling price predicted by the model increases by $2.84. For this to be equal to $53.8 we need to multiply it by 19.94366 - i.e., an increase by 18.94366.

Code
53.8/2.84
[1] 18.94366
Code
y1 = -10536 + (53.8*1) + 2.84
y2 = -10536 + (53.8*2) + 2.84
print(y2 - y1)
[1] 53.8
Code
y1 = -10536 + (53.8*1) + 2.84
y2 = -10536 + (53.8*1) + (2.84*(1 + 18.94366))
print(y2 - y1)
[1] 53.79999

Question 2

A

H0: Mean salary for men and women is the same

Ha: Mean salary for men and women is NOT the same

We can test this using a two-sample t-test

Code
data(salary)

# testing for variance
var.test(salary ~ sex, data=salary)

    F test to compare two variances

data:  salary by sex
F = 0.84242, num df = 37, denom df = 13, p-value = 0.6525
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.3015275 1.9189474
sample estimates:
ratio of variances 
         0.8424225 

The p-value of F-test is 0.6525 which is greater than the significance level (alpha = 0.05). So we can say that there is no significant difference between the variances of the two sets of data.

Code
t.test(salary ~ sex, data=salary, var.equal=TRUE)

    Two Sample t-test

data:  salary by sex
t = 1.8474, df = 50, p-value = 0.0706
alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
95 percent confidence interval:
 -291.257 6970.550
sample estimates:
  mean in group Male mean in group Female 
            24696.79             21357.14 

The p-value of t-test is 0.0706 which is greater than the significance level (alpha = 0.05). So we can say that there is no significant difference in the mean salary between male and female faculty (at a 5% significance level).

B

Code
model = lm(salary ~ degree + rank + sex + year + ysdeg, data=salary)
confint(model)
                 2.5 %      97.5 %
(Intercept) 14134.4059 17357.68946
degreePhD    -663.2482  3440.47485
rankAssoc    2985.4107  7599.31080
rankProf     8396.1546 13841.37340
sexFemale    -697.8183  3030.56452
year          285.1433   667.47476
ysdeg        -280.6397    31.49105

From sexFemale, we can say that the 95% confidence interval for the difference in salary between males and females is [-697.8183, 3030.56452]

C

Code
summary(model)

Call:
lm(formula = salary ~ degree + rank + sex + year + ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 15746.05     800.18  19.678  < 2e-16 ***
degreePhD    1388.61    1018.75   1.363    0.180    
rankAssoc    5292.36    1145.40   4.621 3.22e-05 ***
rankProf    11118.76    1351.77   8.225 1.62e-10 ***
sexFemale    1166.37     925.57   1.260    0.214    
year          476.31      94.91   5.018 8.65e-06 ***
ysdeg        -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16

(a)

At a 95% confidence level, only rank and year are (statistically) significant predictors of salary.

(b)

Since we now know that rank and year are the significant variables, we will only consider those for interpretation. Associate Professors (rankAssoc) and Full Professors (rankProf) earn more than Assistant Professors (baseline category) by $5292.36 and $11118.76 respectively. Similarly, professors with more working experience (year) earn more. However, rank has a higher effect on salary than year as the coefficient for rank is much higher.

D

Code
salary$rank <- relevel(salary$rank, ref='Assoc')
model.modified = lm(salary ~ degree + rank + sex + year + ysdeg, data=salary)
summary(model.modified)

Call:
lm(formula = salary ~ degree + rank + sex + year + ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 21038.41    1109.12  18.969  < 2e-16 ***
degreePhD    1388.61    1018.75   1.363    0.180    
rankAsst    -5292.36    1145.40  -4.621 3.22e-05 ***
rankProf     5826.40    1012.93   5.752 7.28e-07 ***
sexFemale    1166.37     925.57   1.260    0.214    
year          476.31      94.91   5.018 8.65e-06 ***
ysdeg        -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16
Code
5826.40 + 5292.36
[1] 11118.76

We see similar results as above. Since the baseline category is now Associate Professor, we see a negative coefficient for Assistant Professor category which says that Associate Professors make $5292.36 more than Assistant Professors. Similarly, Full Professors make $5826.40 more than Associate Professors.

E

Code
summary(lm(salary ~ degree + sex + year + ysdeg, data=salary))

Call:
lm(formula = salary ~ degree + sex + year + ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-8146.9 -2186.9  -491.5  2279.1 11186.6 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 17183.57    1147.94  14.969  < 2e-16 ***
degreePhD   -3299.35    1302.52  -2.533 0.014704 *  
sexFemale   -1286.54    1313.09  -0.980 0.332209    
year          351.97     142.48   2.470 0.017185 *  
ysdeg         339.40      80.62   4.210 0.000114 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3744 on 47 degrees of freedom
Multiple R-squared:  0.6312,    Adjusted R-squared:  0.5998 
F-statistic: 20.11 on 4 and 47 DF,  p-value: 1.048e-09

In addition to the variable year, degreePhD and ysdeg are also significantly contributing to salary. However, the p-value for sexFemale is still much greater than 0.05, meaning - sex is not significant when predicting for salary (at a 95% confidence level). We also observe that the coefficients for degreePhD, sexFemale and ysdeg got reversed (positive to negative and vice-versa). This suggests that professors with an MS are earning approximately $3300 more than the professors with a PhD, male professors are earning around $1286 more than female professors. We also see that the R-squared value has dropped - so this model does a poor job in explaining the variation in comparison with the other model where we included rank.

F

Code
salary <- mutate(salary, dean = case_when(ysdeg < 15 ~ "New",
                                         ysdeg >=15 ~ "Old"))

Since the hypothesis depends on ysdeg and we created a new variable using it, I’m assuming that removing ysdeg should remove any multicollinearity.

Code
vif(lm(salary ~ dean + degree + sex + rank + year + ysdeg, data=salary))
           GVIF Df GVIF^(1/(2*Df))
dean   3.986217  1        1.996551
degree 2.144390  1        1.464373
sex    1.542571  1        1.242003
rank   4.347951  2        1.444013
year   2.445467  1        1.563799
ysdeg  6.821294  1        2.611761

As a rule of thumb, a vif score over 5 is a problem. So I will be removing ysdeg from the predictors.

Code
summary(lm(salary ~ dean + degree + sex + rank + year, data=salary))

Call:
lm(formula = salary ~ dean + degree + sex + rank + year, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-3588.0 -1532.2  -232.2   565.7  9132.5 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  20468.7      951.7  21.507  < 2e-16 ***
deanOld      -2421.6     1187.9  -2.038   0.0474 *  
degreePhD     1073.5      843.3   1.273   0.2096    
sexFemale     1046.7      858.0   1.220   0.2289    
rankAsst     -5012.5     1002.3  -5.001 9.16e-06 ***
rankProf      6213.3     1045.0   5.946 3.76e-07 ***
year           450.7       81.5   5.530 1.55e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2360 on 45 degrees of freedom
Multiple R-squared:  0.8597,    Adjusted R-squared:  0.841 
F-statistic: 45.95 on 6 and 45 DF,  p-value: < 2.2e-16

We see a negative coefficient for deanOld with a p-value less than 0.05 which suggests that the hypothesis is true. The faculty hired by the new dean make $2421.6 more than the old faculty.

Question 3

A

Code
data("house.selling.price")
summary(lm(Price ~ Size + New, data = house.selling.price))

Call:
lm(formula = Price ~ Size + New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-205102  -34374   -5778   18929  163866 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -40230.867  14696.140  -2.738  0.00737 ** 
Size           116.132      8.795  13.204  < 2e-16 ***
New          57736.283  18653.041   3.095  0.00257 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared:  0.7226,    Adjusted R-squared:  0.7169 
F-statistic: 126.3 on 2 and 97 DF,  p-value: < 2.2e-16

Both size and new are significant at a 95% confidence level. The coefficient for size suggests that for 1 square foot increase in the size of home, the price increases by $116.132 (considering that both are of the same condition - either old/new) and that for new suggests that if the home is new, the price increases by $57736.283 when compared to that of an old home of the same size.

B

Using the coefficients from above, the prediction equation becomes:

price = -40230.867 + (116.132 * home.size) + (57736.283 * home.new)

Separate equations for new and not new homes:

price.old = -40230.867 + (116.132 * home.size) price.new = -40230.867 + (116.132 * home.size) + 57736.283

C

(i)

Code
price.new = -40230.867 + (116.132 * 3000) + 57736.283
price.new
[1] 365901.4

(ii)

Code
price.old = -40230.867 + (116.132 * 3000)
price.old
[1] 308165.1

D

Code
summary(lm(Price ~ Size * New, data = house.selling.price))

Call:
lm(formula = Price ~ Size * New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-175748  -28979   -6260   14693  192519 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -22227.808  15521.110  -1.432  0.15536    
Size           104.438      9.424  11.082  < 2e-16 ***
New         -78527.502  51007.642  -1.540  0.12697    
Size:New        61.916     21.686   2.855  0.00527 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 52000 on 96 degrees of freedom
Multiple R-squared:  0.7443,    Adjusted R-squared:  0.7363 
F-statistic: 93.15 on 3 and 96 DF,  p-value: < 2.2e-16

E

Code
data.old <- subset(house.selling.price, New == 0)
data.new <- subset(house.selling.price, New == 1)


ggplot() +
  geom_smooth(data=data.old, aes(x = Size, y = Price), 
              method = "lm", se = FALSE, color = "blue") +
  geom_point(data=data.old, aes(x = Size, y = Price), color = "blue") +
  geom_smooth(data=data.new, aes(x = Size, y = Price), 
              method = "lm", se = FALSE, color = "red") +
  geom_point(data=data.new, aes(x = Size, y = Price), color = "red")
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

The red line and dots represent the new homes whereas the blue line and dots the not new homes.

F

From the coefficients that we got in D, the prediction equations become:

price = -22227.808 + (104.438 * home.size) + (-78527.502 * home.new) + (61.916 * home.size * home.new)

price.old = -22227.808 + (104.438 * home.size)

price.new = -22227.808 + (104.438 * home.size) - 78527.502 + (61.916 * home.size)

(i)

Code
price.new = -22227.808 + (104.438 * 3000) - 78527.502 + (61.916 * 3000)
price.new
[1] 398306.7

(ii)

Code
price.old = -22227.808 + (104.438 * 3000)
price.old
[1] 291086.2

G

(i)

Code
price.new = -22227.808 + (104.438 * 1500) - 78527.502 + (61.916 * 1500)
price.new
[1] 148775.7

(ii)

Code
price.old = -22227.808 + (104.438 * 1500)
price.old
[1] 134429.2

The difference in the price for new and not new homes of the same size seems to be less for smaller homes. This could suggest that size contributes more to price than the condition of the home.

H

I would prefer the second model with interaction as both the multiple R-squared and adjusted R-squared for that model are higher than that without interaction.