Homework 4, Intro to Quant
Author

Asch Harwood

Published

April 25, 2023

Code
library("dplyr")
library("knitr")
library(kableExtra)
library(xtable)
library(ggplot2)
library(GGally)
library(lme4)
library(car)
library(dplyr)
library(alr4)
library(stargazer)
library(smss)

Question 1

A

Code
x1 <-  1240 #house size
x2 <- 18000 #lot size
y_observed <- 145000
Code
# find predicted selling price
y_hat <- (-10536) + (53.8*x1) + (2.84*x2)
residual <- y_observed - y_hat
cat('Predict sale price: $',  y_hat)
Predict sale price: $ 107296
Code
cat('\n')
Code
cat('Residual: $', residual)
Residual: $ 37704
Code
cat('\n')
Code
cat('The predicted sale price of', y_hat, 'dollars is', residual, 'dollars less than the actual observed house price of', y_observed, '.')
The predicted sale price of 107296 dollars is 37704 dollars less than the actual observed house price of 145000 .

B

Holding the lot size fixed, each unit increase in square footage increases the house value by 53.8 dollars, which is the coefficient.

C

The lot size would need to increase by 19.94 square feet to increase the home value by 53.8 dollars.

Question 2

Code
data(salary)

A

We cannot reject the null hypothesis that there is a difference between the mean male and female salary at the 0.05 significance level given the observed p-value of 0.09. The 95 percent confidence interval supports this conclusion because it contains zero.

Code
ggplot(data = salary, aes(x=salary, fill=sex)) + 
  geom_histogram(bins=5)

Code
male_salaries <- salary[salary$sex == "Male",]$salary
female_salaries <- salary[salary$sex == "Female",]$salary
t.test(male_salaries, female_salaries)

    Welch Two Sample t-test

data:  male_salaries and female_salaries
t = 1.7744, df = 21.591, p-value = 0.09009
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -567.8539 7247.1471
sample estimates:
mean of x mean of y 
 24696.79  21357.14 

B

Code
data(salary)
fit <- lm(salary ~ rank + sex + year + ysdeg + degree, data = salary)
sex_ci <- confint(fit)["sexFemale", ]
cat('sexFemale confidence interval: ', sex_ci)
sexFemale confidence interval:  -697.8183 3030.565

C

Intercept: The intercept of $15,746.05 refers to when all coefficients are zero, which can be interpreted as the ‘base’ salary. In this case, our base reference professor is a male, assistant professor with zero years of experience and 0 years since graduation.

rankAssoc: Holding all else equal, associate professors make $5,292.36 more than assistant professors. This finding is statistically significant, which means we can reject the null hypothesis that there is no different in salary between associate and assistant professors.

rankProf: Holding all else equal, professors make $11,118.76 more than assistant professors. This finding is statistically significant, which means we can reject the null hypothesis that there is no different in salary between professors and assistant professors.

sexFemale: Holding all else equal, female professors make $1,166.37 more than male professors. This finding is NOT statistically significant, which means we cannot reject the null hypothesis that there is no difference in salary between male and female professors.

year: Holding all else equal, for each year increase in years in current position, salary increases by $476.31. This finding is statistically significant, which means we can reject the null hypothesis that an increase in years does not increase salary.

ysdeg: Holding all else equal, for each year increase in years since degree, salary decreases by $124.57. This finding is NOT statistically significant, which means we cannot reject the null hypothesis that a change in years in position does not have a corresponding change in salary.

degreePhD: Holding all else equal, professors with PhD’s make $1,388 more than those with masters degrees. This is NOT statistically significant.

Code
summary(fit)

Call:
lm(formula = salary ~ rank + sex + year + ysdeg + degree, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 15746.05     800.18  19.678  < 2e-16 ***
rankAssoc    5292.36    1145.40   4.621 3.22e-05 ***
rankProf    11118.76    1351.77   8.225 1.62e-10 ***
sexFemale    1166.37     925.57   1.260    0.214    
year          476.31      94.91   5.018 8.65e-06 ***
ysdeg        -124.57      77.49  -1.608    0.115    
degreePhD    1388.61    1018.75   1.363    0.180    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16

D

rankAsst: Holding all else equal, associate professors make $11,118.76 less than a professor. This finding is statistically significant.

rankAssoc: Holding all else equal, assistant professors make $5,826.40 less than a professor. This finding is statistically significant.

Code
salary$rank <- relevel(salary$rank, ref="Prof")
fit <- lm(salary ~ rank + sex + year + ysdeg + degree, data = salary)
summary(fit)

Call:
lm(formula = salary ~ rank + sex + year + ysdeg + degree, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  26864.81    1375.29  19.534  < 2e-16 ***
rankAsst    -11118.76    1351.77  -8.225 1.62e-10 ***
rankAssoc    -5826.40    1012.93  -5.752 7.28e-07 ***
sexFemale     1166.37     925.57   1.260    0.214    
year           476.31      94.91   5.018 8.65e-06 ***
ysdeg         -124.57      77.49  -1.608    0.115    
degreePhD     1388.61    1018.75   1.363    0.180    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16

E

By excluding rank:

  • the model as a whole is less ‘useful’. R-squared has gone down from 86 to 63, as has the adjusted r-squared, even though we have simplified the model. This means the model without rank has less ‘explanatory’ power. We also see an increase in the residual standard error.

  • sexFemale in the new model is now correlated with a decreased salary of $1,286.54 compared to the original model’s increase of $1,166.37. Sex is still not statistically significant.

  • There is a slight decrease in the coefficient for year.

  • In the new model, an increase in one year in number of years since degree is associated with a $339.40 increase in salary, compared to the $124.57 reduction from the previous model. This finding is statistically significant.

  • In the new model, having a phd is associated with a decline in salary by $3,299.35 compared to the $1,388.61 bump in the original model.

Code
fit_no_rank <- lm(salary ~ sex + year + ysdeg + degree, data = salary)
Code
# Create a side-by-side table of the models using stargazer
stargazer(fit, fit_no_rank,type = "text",
          title = "Professor Salary Regression",
          align = TRUE,
          column.labels = c("M1:AllVari", "M2:NoRank"),
          ci = TRUE, # Show confidence intervals
          digits = 2) 

Professor Salary Regression
==================================================================
                                 Dependent variable:              
                    ----------------------------------------------
                                        salary                    
                          M1:AllVari              M2:NoRank       
                              (1)                    (2)          
------------------------------------------------------------------
rankAsst                 -11,118.76***                            
                    (-13,768.19, -8,469.34)                       
                                                                  
rankAssoc                -5,826.40***                             
                    (-7,811.72, -3,841.09)                        
                                                                  
sexFemale                  1,166.37               -1,286.54       
                      (-647.71, 2,980.45)   (-3,860.15, 1,287.06) 
                                                                  
year                       476.31***               351.97**       
                       (290.28, 662.34)        (72.71, 631.23)    
                                                                  
ysdeg                       -124.57               339.40***       
                       (-276.44, 27.30)        (181.38, 497.41)   
                                                                  
degreePhD                  1,388.61              -3,299.35**      
                      (-608.09, 3,385.32)    (-5,852.24, -746.46) 
                                                                  
Constant                 26,864.81***            17,183.57***     
                    (24,169.30, 29,560.33)  (14,933.65, 19,433.50)
                                                                  
------------------------------------------------------------------
Observations                  52                      52          
R2                           0.86                    0.63         
Adjusted R2                  0.84                    0.60         
Residual Std. Error   2,398.42 (df = 45)      3,743.50 (df = 47)  
F Statistic          44.24*** (df = 6; 45)  20.11*** (df = 4; 47) 
==================================================================
Note:                                  *p<0.1; **p<0.05; ***p<0.01

F

To prevent multicollinearity, I excluded ys_deg from the model. A pair plot visual inspection suggests that ys_deg is correlated with years. Without a visual inspection, we already know it is correlated with dean_selected because dean_selected is derived from ys_deg. We want to remove multicollinearity because it can influence other coefficients in the model, possibly obscuring the true relationship between the independent and dependent variable.

There is evidence to suggest that the dean is preferentially rewarding staff he has hired. Holding all else equal, those hired by the dean earn $2,160 more than those who were not. This number is statistically significant.

Code
#add dean selection
salary$dean_selected <- ifelse(salary$ysdeg<=15, 1, 0)

#dropping ysdeg b/c correlated with dean_selected
x_no_y <- subset(salary, select = -c(salary, ysdeg))
#pairs(x_no_y)

# not including ys_deg b/c correlation with dean_selected and year
fit_dean <- lm(salary ~ degree + rank + sex + dean_selected + year, data = salary)
summary(fit_dean)

Call:
lm(formula = salary ~ degree + rank + sex + dean_selected + year, 
    data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-3403.3 -1387.0  -167.0   528.2  9233.8 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    24425.32    1107.52  22.054  < 2e-16 ***
degreePhD        818.93     797.48   1.027   0.3100    
rankAsst      -11096.95    1191.00  -9.317 4.54e-12 ***
rankAssoc      -6124.28    1028.58  -5.954 3.65e-07 ***
sexFemale        907.14     840.54   1.079   0.2862    
dean_selected   2163.46    1072.04   2.018   0.0496 *  
year             434.85      78.89   5.512 1.65e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2362 on 45 degrees of freedom
Multiple R-squared:  0.8594,    Adjusted R-squared:  0.8407 
F-statistic: 45.86 on 6 and 45 DF,  p-value: < 2.2e-16
Code
vif(fit_dean)
                  GVIF Df GVIF^(1/(2*Df))
degree        1.341872  1        1.158392
rank          2.964200  2        1.312130
sex           1.295820  1        1.138341
dean_selected 2.678486  1        1.636608
year          1.726209  1        1.313853

Question 3

Code
data("house.selling.price")
house.selling.price$New <- as.factor(house.selling.price$New)

A

Size: Holding all else equal, a one square foot increase in house size adds $116 dollars to its sale price. This finding is statistically significant.

New: Holding all else equal, a new house sells for $57,736 more than an ‘old’ house. This finding is also statistically significant.

Code
fit <- lm(Price ~ Size + New, data = house.selling.price)
summary(fit)

Call:
lm(formula = Price ~ Size + New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-205102  -34374   -5778   18929  163866 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -40230.867  14696.140  -2.738  0.00737 ** 
Size           116.132      8.795  13.204  < 2e-16 ***
New1         57736.283  18653.041   3.095  0.00257 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared:  0.7226,    Adjusted R-squared:  0.7169 
F-statistic: 126.3 on 2 and 97 DF,  p-value: < 2.2e-16

B

Combined Model

price = -40230.867 + 116.132*x_size + 57736.283*x_new

In this model, our price prediction is based on our intercept to -40,230.867 plus 116.132 multiplied by the house square footage plus $57,736 if the house is new.

New House Model

price_new_house = 116.132*x_size + 17505.42

Old House Model

price_old_house = 116.132*x_size - 40230.867

Code
-40230.867 + 57736.283
[1] 17505.42

C

Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.

Code
# house size
x_size <- 3000

# new model
price_new_house <- 116.132*x_size + 17505.42

#old house model
price_old_house = 116.132*x_size - 40230.867

cat('New house estimate: ', price_new_house)
New house estimate:  365901.4
Code
cat('\n')
Code
cat('Old house estimate: ', price_old_house)
Old house estimate:  308165.1

D

Code
fit <- lm(Price ~ Size + New + Size*New, data = house.selling.price)
summary(fit)

Call:
lm(formula = Price ~ Size + New + Size * New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-175748  -28979   -6260   14693  192519 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -22227.808  15521.110  -1.432  0.15536    
Size           104.438      9.424  11.082  < 2e-16 ***
New1        -78527.502  51007.642  -1.540  0.12697    
Size:New1       61.916     21.686   2.855  0.00527 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 52000 on 96 degrees of freedom
Multiple R-squared:  0.7443,    Adjusted R-squared:  0.7363 
F-statistic: 93.15 on 3 and 96 DF,  p-value: < 2.2e-16

E

New1: Coefficient of $-78,537. Compared to old houses, new houses with zero square feet would sell for $78,527, which is not possible. This result, however, is NOT statistically significant.

Size:New1: For every unit increase in size, new homes sale price will increase by 61 dollars more than an old home. This coefficient is statistically significant.

F

Code
new_house <- data.frame(Size = 3000, New = 1)
new_house$New <- as.factor(new_house$New)
new_house_predicted_price <- as.numeric(predict(fit, newdata = new_house))


old_house <- data.frame(Size = 3000, New = 0)
old_house$New <- as.factor(old_house$New)
old_house_predicted_price <- as.numeric(predict(fit, newdata = old_house))

cat('New house predicted price: ', new_house_predicted_price)
New house predicted price:  398307.5
Code
cat('\n')
Code
cat('Old house predicted price: ', old_house_predicted_price)
Old house predicted price:  291087.4

G

Given the use of our interaction term, which says that as homes increase in square footage, new homes will increase by $61 per square foot more than old homes. Therefore as a result, as houses get larger, the price difference between new and old homes will also get larger.

Code
new_house <- data.frame(Size = 1500, New = 1)
new_house$New <- as.factor(new_house$New)
new_house_predicted_price <- as.numeric(predict(fit, newdata = new_house))


old_house <- data.frame(Size = 1500, New = 0)
old_house$New <- as.factor(old_house$New)
old_house_predicted_price <- as.numeric(predict(fit, newdata = old_house))

cat('New house predicted price: ', new_house_predicted_price)
New house predicted price:  148776.1
Code
cat('\n')
Code
cat('Old house predicted price: ', old_house_predicted_price)
Old house predicted price:  134429.8

H

I prefer the model with the interaction term. While only slightly stronger in terms of its R-Squared, the model and the interaction term are simple enough to be easily interpreted. I also prefer it because ‘New’ by itself is not statistically significant. I feel the New variable is not ‘specific’ enough about what it refers to and thus not actually that useful in understanding home prices. What constitutes a new house? What happens if there is an old house that has been remodeled or has an addition?

Code
fit_interaction <- lm(Price ~ Size + New + Size*New, data = house.selling.price)
fit_no_interaction <- lm(Price ~ Size + New, data = house.selling.price)

stargazer(fit_interaction, fit_no_interaction,type = "text",
          title = "New vs Old Home Prices",
          align = TRUE,
          column.labels = c("M1:wInteraction", "M2:NoInteraction"),
          ci = TRUE, # Show confidence intervals
          digits = 2) 

New vs Old Home Prices
=====================================================================
                                   Dependent variable:               
                    -------------------------------------------------
                                          Price                      
                        M1:wInteraction          M2:NoInteraction    
                              (1)                      (2)           
---------------------------------------------------------------------
Size                       104.44***                116.13***        
                        (85.97, 122.91)          (98.89, 133.37)     
                                                                     
New1                       -78,527.50              57,736.28***      
                    (-178,500.60, 21,445.64)  (21,176.99, 94,295.57) 
                                                                     
Size:New1                   61.92***                                 
                        (19.41, 104.42)                              
                                                                     
Constant                   -22,227.81             -40,230.87***      
                     (-52,648.62, 8,193.01)  (-69,034.77, -11,426.96)
                                                                     
---------------------------------------------------------------------
Observations                  100                      100           
R2                            0.74                     0.72          
Adjusted R2                   0.74                     0.72          
Residual Std. Error   51,998.11 (df = 96)      53,880.95 (df = 97)   
F Statistic          93.15*** (df = 3; 96)    126.34*** (df = 2; 97) 
=====================================================================
Note:                                     *p<0.1; **p<0.05; ***p<0.01