hw4
challenge4
Jerin Jacob
Author

Jerin Jacob

Published

May 11, 2023

Code
library(readxl)
library(dplyr)
library(magrittr)
library(alr4)
library(smss)
library(ggplot2)
library(stargazer)
knitr::opts_chunk$set(echo = TRUE)

Question 1: A) Finding the predicted house price. We are writing a function to find y_hat using the parameters x1 and x2.

Code
y_hat <- function(x1,x2) {
  -10536 + 53.8*x1 + 2.84*x2
}

predicted_price <- y_hat(1240, 18000)

cat('The predicted selling price is $', predicted_price, sep = '')
The predicted selling price is $107296
Code
Actual_price <- 145000
cat('The residual is', Actual_price - predicted_price)
The residual is 37704

The residual is positive which means the model under predicted the selling price considerably high difference.

  1. For a fixed lot size, the selling price predicted will increase by $53.8 for every unit increase of home size because it is the coefficient of the size of home variable when lot size is also in the model.

  2. For one square foot increase, the house price increase $53.8, keeping lot size constant. For one unit increase in lot size, house price increase $2.84. So to make an equal change in home price that of increase in home size, the lot size should be increased 53.8/2.84 times.

Question 2:

Loading the data

Code
data("salary", package = "alr4")
head(salary)
   degree rank    sex year ysdeg salary
1 Masters Prof   Male   25    35  36350
2 Masters Prof   Male   13    22  35350
3 Masters Prof   Male   10    23  28200
4 Masters Prof Female    7    27  26775
5     PhD Prof   Male   19    30  33696
6 Masters Prof   Male   16    21  28516

To test the hypothesis that the mean salary for men and women is the same without regard to any other variable, we can run a simple linear regression model with sex as the only explanatory variable. This is equivalent to doing a two-sample t-test for salary for Male and Female groups

Code
summary(lm(salary ~ sex, data = salary))

Call:
lm(formula = salary ~ sex, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-8602.8 -4296.6  -100.8  3513.1 16687.9 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)    24697        938  26.330   <2e-16 ***
sexFemale      -3340       1808  -1.847   0.0706 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5782 on 50 degrees of freedom
Multiple R-squared:  0.0639,    Adjusted R-squared:  0.04518 
F-statistic: 3.413 on 1 and 50 DF,  p-value: 0.0706

The p-value is less than .1. Therefore we can reject the null hypothesis, which means that the test is significant at 10% significance level and the means are not same. The coefficient of sexFemale is -3340 which means that for all other variables kept constant, there will be a decrease of $3340 in female salary compared to that of male salary.

Running a multiple linear regression with salary as the outcome variable and all other variables as predictors with confidence interval 95%.

Code
lm(salary ~ ., data = salary) |>
  confint()
                 2.5 %      97.5 %
(Intercept) 14134.4059 17357.68946
degreePhD    -663.2482  3440.47485
rankAssoc    2985.4107  7599.31080
rankProf     8396.1546 13841.37340
sexFemale    -697.8183  3030.56452
year          285.1433   667.47476
ysdeg        -280.6397    31.49105

95% confidence interval for the sexFemale variable is (-697, 3031). This suggests a confidence interval between $697 less or $3031 more salary for female faculty relative to male faculty, controlling for other variables.

Code
summary(lm(salary ~ ., data = salary))

Call:
lm(formula = salary ~ ., data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 15746.05     800.18  19.678  < 2e-16 ***
degreePhD    1388.61    1018.75   1.363    0.180    
rankAssoc    5292.36    1145.40   4.621 3.22e-05 ***
rankProf    11118.76    1351.77   8.225 1.62e-10 ***
sexFemale    1166.37     925.57   1.260    0.214    
year          476.31      94.91   5.018 8.65e-06 ***
ysdeg        -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16

Degree: This variable is not statistically significant as the p-value is .180 which is over any conventially accepted significance level.It is a dummy variable. The coefficient suggests that the PhDs make $1388.61 more than those with Master’s controlling for everything else.

Rank: The model takes rank as a categorical variable, ignoring its order. Usually, for ordered categorical variables like rank, it is either treated as just a regular categorical variable or as a numeric variable. When the variable has a lot of levels or the distances between each level are reasonanly equal, it is commonly considered as numeric. But in this case, since there are only 3 levels, it can be accepted as a regular categorical variable.

For rank, the base/reference level is Assistant Professors. rankAssoc suggests that Associate Professors make $5292 more than the base level and rankProf suggests Professors make $11118.76 more than base level. Variables are statistically significant.

To test the statistical significance of the rank variable as a whole, rather than the individual dummy variables, we can do a partial F-test to compare the model with all the variables to the one without any rank dummies. The easiest way to do this is;

Code
fit1 <- lm(salary ~ ., data = salary)
fit2 <- lm(salary ~ -rank, data = salary)

anova(fit1, fit2)
Analysis of Variance Table

Model 1: salary ~ degree + rank + sex + year + ysdeg
Model 2: salary ~ -rank
  Res.Df        RSS Df   Sum of Sq      F    Pr(>F)    
1     45  258858365                                    
2     51 1785729858 -6 -1526871493 44.239 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The very low p-value suggests that the rank variable is significant as a whole.

Sex: The p-value suggests that this variable is not significant at conventional levels. The coefficient suggests that female faculty make $1166 more after controlling all other variables, but interpreting coefficient when the effect is insignificant is not meaningful.

Year: Is statistically significant and the coefficient suggests that for every year increased, there is an associated increase of $476 in salary.

ysdeg: The variable is insignificant. The coefficient suggests that every additional year that passes since degree is associated with $124 less salary.

Code
salary$rank <- relevel(salary$rank, ref = 'Prof')
summary(lm(salary ~ ., data = salary))

Call:
lm(formula = salary ~ ., data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  26864.81    1375.29  19.534  < 2e-16 ***
degreePhD     1388.61    1018.75   1.363    0.180    
rankAsst    -11118.76    1351.77  -8.225 1.62e-10 ***
rankAssoc    -5826.40    1012.93  -5.752 7.28e-07 ***
sexFemale     1166.37     925.57   1.260    0.214    
year           476.31      94.91   5.018 8.65e-06 ***
ysdeg         -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16

Now, the reference level for rank is Professors and therefore the coeffecients for other levels of rank has been changed accordingly. Assistant Professors make $11118.78 less than that of Professors controlling for other variables. Associate professors make $5826 less than Professors, controlling for all other variables.

Code
summary(lm(salary ~ .-rank, data = salary))

Call:
lm(formula = salary ~ . - rank, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-8146.9 -2186.9  -491.5  2279.1 11186.6 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 17183.57    1147.94  14.969  < 2e-16 ***
degreePhD   -3299.35    1302.52  -2.533 0.014704 *  
sexFemale   -1286.54    1313.09  -0.980 0.332209    
year          351.97     142.48   2.470 0.017185 *  
ysdeg         339.40      80.62   4.210 0.000114 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3744 on 47 degrees of freedom
Multiple R-squared:  0.6312,    Adjusted R-squared:  0.5998 
F-statistic: 20.11 on 4 and 47 DF,  p-value: 1.048e-09

After removing the rank variable, the regression analysis coefficients chanhged significantly. Now, the coefficient of degreePhD is negative(-3299) which suggests that a faculty with a PhD degree will be paid $3299 less than non-PhD faculty, controlling for other variables. The sign for sex and ysdeg also flipped but are statistically significant.

  1. We need to create a dummy variable ‘newDean_hire’ for <=15 years after highest degree. Since we are creating the new variable from an existing variable, there is a high chance of multicollinearity. Testing the correlation between the new variable and original variable is a good idea!
Code
salary$newDean_hire <- ifelse(salary$ysdeg <=15, 1,0)
cor.test(salary$newDean_hire, salary$ysdeg)

    Pearson's product-moment correlation

data:  salary$newDean_hire and salary$ysdeg
t = -11.101, df = 50, p-value = 4.263e-15
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.9074548 -0.7411040
sample estimates:
       cor 
-0.8434239 

The two variables are highly correlated, negatively. So, when we run the regression, we can exclude ysdeg and only include newDean_hire instead, along with other control variables.

Code
summary(lm(salary ~ .-ysdeg, data = salary))

Call:
lm(formula = salary ~ . - ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-3403.3 -1387.0  -167.0   528.2  9233.8 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   24425.32    1107.52  22.054  < 2e-16 ***
degreePhD       818.93     797.48   1.027   0.3100    
rankAsst     -11096.95    1191.00  -9.317 4.54e-12 ***
rankAssoc     -6124.28    1028.58  -5.954 3.65e-07 ***
sexFemale       907.14     840.54   1.079   0.2862    
year            434.85      78.89   5.512 1.65e-06 ***
newDean_hire   2163.46    1072.04   2.018   0.0496 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2362 on 45 degrees of freedom
Multiple R-squared:  0.8594,    Adjusted R-squared:  0.8407 
F-statistic: 45.86 on 6 and 45 DF,  p-value: < 2.2e-16

The newDean_hire variable’s coefficient is 2163 and it is statistically significant at a significance level of .05%. It suggests that the dean was giving generous offers to those people who have 15 years or less after their highest degree.

Just checking how the regression will perform if we included both the variable!

Code
summary(lm(salary ~ ., data = salary))

Call:
lm(formula = salary ~ ., data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-3621.2 -1336.8  -271.6   530.1  9247.6 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   25179.14    1901.59  13.241  < 2e-16 ***
degreePhD      1135.00    1031.16   1.101    0.277    
rankAsst     -11411.45    1362.02  -8.378 1.16e-10 ***
rankAssoc     -6177.44    1043.04  -5.923 4.39e-07 ***
sexFemale      1084.09     921.49   1.176    0.246    
year            460.35      95.09   4.841 1.63e-05 ***
ysdeg           -47.86      97.71  -0.490    0.627    
newDean_hire   1749.09    1372.83   1.274    0.209    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2382 on 44 degrees of freedom
Multiple R-squared:  0.8602,    Adjusted R-squared:  0.838 
F-statistic: 38.68 on 7 and 44 DF,  p-value: < 2.2e-16

Now, both the variables are not significant because of the multicollinearity.

Question 3: Loading the data

Code
data("house.selling.price", package = "smss")
head(house.selling.price)
  case Taxes Beds Baths New  Price Size
1    1  3104    4     2   0 279900 2048
2    2  1173    2     1   0 146500  912
3    3  3076    4     2   0 237700 1654
4    4  1608    3     2   0 200000 2068
5    5  1454    3     3   0 159900 1477
6    6  2997    3     2   1 499900 3153
Code
y_hat <- lm(Price ~ Size + New, data = house.selling.price)

summary(y_hat)

Call:
lm(formula = Price ~ Size + New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-205102  -34374   -5778   18929  163866 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -40230.867  14696.140  -2.738  0.00737 ** 
Size           116.132      8.795  13.204  < 2e-16 ***
New          57736.283  18653.041   3.095  0.00257 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared:  0.7226,    Adjusted R-squared:  0.7169 
F-statistic: 126.3 on 2 and 97 DF,  p-value: < 2.2e-16

Both Size and New are statistically significant variables. When one square foot of size increase there will be $116 increase in the house price, controlling for other variable. The new houses will have $57736 more in price than that of old house, controling for other variable.

  1. The prediction equation is: E[Price] = -40230.867 + 116.132 * Size + 57736.283 * New

For New variable, the values are binary which means it takes 1 as value when the house is new and 0 when it is an old house.

The prediction equation for new houses is : E[Price] = -40230.867 + 116.132 * size + 57736.283 * 1. ie, E[Price] = 17505.42 + 116.132 * size

The prediction equation for old houses is : E[Price] = -40230.867 + 116.132 * size + 57736.283 * 0 = -40230.867 + 116.132 * size

Code
# One way of finding the predicted selling price
size <- 3000
new <- 1
not_new <- 0

house_price_new <- -40230.867 + 116.132 * size + 57736.283 * new
house_price_not_new <- -40230.867 + 116.132 * size + 57736.283 * not_new

# Print the results
cat("Price for new houses with 3000 sqft size is: [", house_price_new, "]\n")
Price for new houses with 3000 sqft size is: [ 365901.4 ]
Code
cat("Price for houses with 3000 sqft size that are not new is: [", house_price_not_new, "]\n")
Price for houses with 3000 sqft size that are not new is: [ 308165.1 ]
Code
# Easy way

predict_data <- data.frame(Size = c(3000, 3000), New = c(1, 0))
  predict(y_hat, predict_data)
       1        2 
365900.2 308163.9 
Code
y_hat_interaction <- lm(Price ~ Size + New + Size*New , data = house.selling.price)

summary(y_hat_interaction)

Call:
lm(formula = Price ~ Size + New + Size * New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-175748  -28979   -6260   14693  192519 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -22227.808  15521.110  -1.432  0.15536    
Size           104.438      9.424  11.082  < 2e-16 ***
New         -78527.502  51007.642  -1.540  0.12697    
Size:New        61.916     21.686   2.855  0.00527 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 52000 on 96 degrees of freedom
Multiple R-squared:  0.7443,    Adjusted R-squared:  0.7363 
F-statistic: 93.15 on 3 and 96 DF,  p-value: < 2.2e-16

After adding the interaction between size and new, the coefficient of New reversed the sign but is not statistically significant now. The interaction term is statistically significant at a significance level of 1%.

The prediction equation is: E[Price] = -22227.808 + 104.438 * Size - 78527.502 * New + 61.916 * Size*New

For new house, E[Price] = -22227.808 + 104.438 * Size - 78527.502 * 1 + 61.916 * Size1 = -100755.3 + 166.354 Size

For old house, E[Price] = -22227.808 + 104.438 * Size - 78527.502 * 0 + 61.916 * Size0 = -22227.808 + 104.438 Size

Code
predict_data_interaction <- data.frame(Size = c(3000, 3000), New = c(1, 0))
prediction_with_interaction <- predict(y_hat_interaction, predict_data_interaction)
cat("Price for houses with 3000 sqft size that are new and not new are: [", prediction_with_interaction, "] consecutively \n")
Price for houses with 3000 sqft size that are new and not new are: [ 398307.5 291087.4 ] consecutively 

New home is $107,220 pricier than old houses of 3000 sqft.

Code
predict_data_interaction2 <- data.frame(Size = c(1500, 1500), New = c(1, 0))
prediction_with_interaction2 <- predict(y_hat_interaction, predict_data_interaction2)
cat("Price for houses with 1500 sqft size that are new and not new are: [", prediction_with_interaction2, "] consecutively \n")
Price for houses with 1500 sqft size that are new and not new are: [ 148776.1 134429.8 ] consecutively 

New home with 1500 sqft has a predicted price of $148776. Old home with the same size has a predicted price of $134429. New home is $14,347 pricier than old houses of 1500 sqft.

Comparing the differences of prices between new and old houses in F and G, we can conclude that bigger houses have much bigger difference in price between new and old ones.

Code
summary(y_hat)

Call:
lm(formula = Price ~ Size + New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-205102  -34374   -5778   18929  163866 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -40230.867  14696.140  -2.738  0.00737 ** 
Size           116.132      8.795  13.204  < 2e-16 ***
New          57736.283  18653.041   3.095  0.00257 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared:  0.7226,    Adjusted R-squared:  0.7169 
F-statistic: 126.3 on 2 and 97 DF,  p-value: < 2.2e-16
Code
summary(y_hat_interaction)

Call:
lm(formula = Price ~ Size + New + Size * New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-175748  -28979   -6260   14693  192519 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -22227.808  15521.110  -1.432  0.15536    
Size           104.438      9.424  11.082  < 2e-16 ***
New         -78527.502  51007.642  -1.540  0.12697    
Size:New        61.916     21.686   2.855  0.00527 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 52000 on 96 degrees of freedom
Multiple R-squared:  0.7443,    Adjusted R-squared:  0.7363 
F-statistic: 93.15 on 3 and 96 DF,  p-value: < 2.2e-16

The preferred model is with the intereaction term because the interaction term is significant and the Adjusted R-squared is higher in the model with interaction term. The higher Adjusted R-squared means that despite penalizing for the additional term, we have a better fit model.