hw4
Author

Young Soo Choi

Published

April 25, 2023

Question 1

(A)

Code
x1=1240
x2=18000
pred_price<--10536+53.8*x1+2.84*x2
pred_price
[1] 107296

When the given values for size of home and size of lot are 1240 and 18000, the predicted selling price is 107296.

Code
145000-pred_price
[1] 37704

And the residual is 37704. This means that, in the given model, the selling price that cannot be explained by size of home and los is 37704.

(B)

It is increasing by 53.8. Since the coefficient of x1(size of home) is 53.8, if x2(size of lot) is constant, y(predicted selling price) increases by 53.8.

(C)

As discussed above, when x2 is constant and x1 increases by 1 unit, y increases by 53.8. Likewise, if x1 is constant and x2 increases by one unit, y increases by 2.84, the coefficient of x2 in the above equation.

Code
53.8/2.84
[1] 18.94366

Using this relationship, the amount of increase in x2 that gives the same effect as the increase in x1 by one unit is 18.94. That is, if x2 increases 18.94, y increases as x1 increases by one unit.

Question 2

Code
# data loading
library(alr4)
Warning: package 'alr4' was built under R version 4.2.3
Loading required package: car
Loading required package: carData
Loading required package: effects
Warning: package 'effects' was built under R version 4.2.3
lattice theme set by effectsTheme()
See ?effectsTheme for details.
Code
data(salary)
Code
colnames(salary)
[1] "degree" "rank"   "sex"    "year"   "ysdeg"  "salary"

(A)

It is necessary to implement a t-test to verify the hypothesis that the average salary of men and women is the same.

Code
t.test(salary~sex, data=salary)

    Welch Two Sample t-test

data:  salary by sex
t = 1.7744, df = 21.591, p-value = 0.09009
alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
95 percent confidence interval:
 -567.8539 7247.1471
sample estimates:
  mean in group Male mean in group Female 
            24696.79             21357.14 

Using r’s t.test function, the alternative hypothesis is that the difference between the two will not be zero (that is, there is a salary difference between men and women), and the null hypothesis is that there is no salary difference between men and women. However, as a result of the t-test, the p value is greater than 0.05, so the null hypothesis cannot be rejected at the significance level of 0.05. This means that there is no salary difference between men and women.

(B)

Code
sal.lm<-lm(salary~., data=salary)
summary(sal.lm)

Call:
lm(formula = salary ~ ., data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 15746.05     800.18  19.678  < 2e-16 ***
degreePhD    1388.61    1018.75   1.363    0.180    
rankAssoc    5292.36    1145.40   4.621 3.22e-05 ***
rankProf    11118.76    1351.77   8.225 1.62e-10 ***
sexFemale    1166.37     925.57   1.260    0.214    
year          476.31      94.91   5.018 8.65e-06 ***
ysdeg        -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16

The results of deriving multiple regression equations for wages using all variables were obtained. According to the above equation, salary is 1166.37 high if the sex is female.

Code
confint(sal.lm, "sexFemale", level=0.95)
              2.5 %   97.5 %
sexFemale -697.8183 3030.565

Using the above values, the 95% confidence interval of the sex difference between men and women is from -697.8 to 3030.6. Since this confidence interval includes 0, it can be said that the coefficient of the regression equation for sex is not significant at the 5% level of significance.

(C)

In summary, the multiple regression equation derived above can be expressed as follows.

Salary = 15746.1 + 1388.6degreePhD + 5292.4rankAssoc + 11118.8rankProf + 1166.4sexFemale + 476.3years -124.6ysdeg

Prior to the discussion, all significance levels are set at 0.05.

For degree, (a) salary difference between masters and phD is not statistically significant.(p-value is greater than 0.05) (b) And according to the coefficient, the salary of the phD is greater than masters’ by 1388.7. Taken together, when other variables are the same, the Ph.D. owner is paid1388.7 higher than the master’s, but this is not statistically significant.

For rank, (a) salary difference between Asst, Assoc, and Prof is statistically significant.(p-value is less than 0.05) (b) And according to the coefficient, the salary of the Assoc is greater than Asst’s by 5292.4 and the salary of the Prof is greater than Asst’s by 11118.8. Overall, the associate professor receives 5292.4 higher salary than the assistant professor, and the professor receives 11118.8 higher salary than the assistant professor, when any other variables are the same. And it is statistically significant.

For sex, (a) salary difference between male and female is not statistically significant.(p-value is greater than 0.05) (b) And according to the coefficient, the salary of the female is greater than male’s by 1166.4. In general, when other variables are constant, female professors are paid 1166.4 higher than male professors, but this is not statistically significant.

For year, (a) coefficient of year is statistically significant.(p-value is less than 0.05) (b) And according to the coefficient, when the number of years increases by 1, the salary increases by 476.3. In other words, if the number of years of service increases by one year, the salary increases by 476.3 when other variables are constant, which is statistically significant.

For ysdeg, (a) coefficient of ysdeg is not statistically significant.(p-value is greater than 0.05) (b) And according to the coefficient, when the number of ysdeg increases by 1, the salary decreases by 124.6. In the end, according to this model, if years since highest degree increases when other variables are the same, income decreases by 124.6, but this is not statistically significant.

(D)

The baseline can be changed using the relevel function. Through this, Assoc was changed to a baseline instead of Asst and a multiple regression model was derived.

Code
salary$rank.a <- relevel(salary$rank, ref = "Assoc")

sal.lm.2<-lm(salary~degree+rank.a+sex+year+ysdeg, data=salary)

summary(sal.lm.2)

Call:
lm(formula = salary ~ degree + rank.a + sex + year + ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 21038.41    1109.12  18.969  < 2e-16 ***
degreePhD    1388.61    1018.75   1.363    0.180    
rank.aAsst  -5292.36    1145.40  -4.621 3.22e-05 ***
rank.aProf   5826.40    1012.93   5.752 7.28e-07 ***
sexFemale    1166.37     925.57   1.260    0.214    
year          476.31      94.91   5.018 8.65e-06 ***
ysdeg        -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16
Code
5826.40-(-5292.36)
[1] 11118.76

Checking the above coefficients, salary decreases by 5292.4 compared to assoc when it is asst. Prof has a higher salary by 5826.4 compared to assoc. Comparing prop to asst, the difference is 11118.8, which is the same as the difference when the baseline is not changed. In other words, even if the baseline is changed, only the form of the multiple regression model changes and the meaning is the same.

(E)

Except for the rank variable, multiple regression equations were derived again.

Code
sal.lm.nr<-lm(salary~degree+sex+year+ysdeg, data=salary)
summary(sal.lm.nr)

Call:
lm(formula = salary ~ degree + sex + year + ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-8146.9 -2186.9  -491.5  2279.1 11186.6 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 17183.57    1147.94  14.969  < 2e-16 ***
degreePhD   -3299.35    1302.52  -2.533 0.014704 *  
sexFemale   -1286.54    1313.09  -0.980 0.332209    
year          351.97     142.48   2.470 0.017185 *  
ysdeg         339.40      80.62   4.210 0.000114 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3744 on 47 degrees of freedom
Multiple R-squared:  0.6312,    Adjusted R-squared:  0.5998 
F-statistic: 20.11 on 4 and 47 DF,  p-value: 1.048e-09

There are many changes. First of all, Ph.D. ownership acts in the direction of reducing salaries, which is statistically significant (all significance levels are 0.05), and gender has also changed its direction of influence. It was found that women’s salary was lower than that of men, but it was not statistically significant. The influence of ysdeg also changed, and when rank was included, this variable had a negative effect on salary, but now it has a positive effect. And this result is also significant. Finally, year did not change the direction of influence, but the coefficient was slightly reduced. be statistically significant as well.

(F)

First, a new variable(ysdeg.dean) with an “old” when ysdeg exceeds 15 and a “new” if not was created.

Code
library(dplyr)

Attaching package: 'dplyr'
The following object is masked from 'package:car':

    recode
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
Code
new.sal<-salary %>% mutate(ysdeg.dean = ifelse(ysdeg>15, "old", "new"))

Next, in order to check the multicollinearity, the vif value was checked using the vif function.

Code
library(car)
vif<-vif(lm(salary~degree+rank+sex+year+ysdeg.dean, data=new.sal))
vif
               GVIF Df GVIF^(1/(2*Df))
degree     1.341872  1        1.158392
rank       2.964200  2        1.312130
sex        1.295820  1        1.138341
year       1.726209  1        1.313853
ysdeg.dean 2.678486  1        1.636608

Here, there is no value over 5, so it seems that there is no variable with strong multicollinearity. Therefore, I derived multiple regression equations including all variables and confirmed the effect of ysdeg reclassified into old and new.

Code
sal.lm.on<-lm(salary~degree+rank+sex+year+ysdeg.dean, data=new.sal)
summary(sal.lm.on)

Call:
lm(formula = salary ~ degree + rank + sex + year + ysdeg.dean, 
    data = new.sal)

Residuals:
    Min      1Q  Median      3Q     Max 
-3403.3 -1387.0  -167.0   528.2  9233.8 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   15491.84     806.32  19.213  < 2e-16 ***
degreePhD       818.93     797.48   1.027   0.3100    
rankAssoc      4972.66     997.17   4.987 9.61e-06 ***
rankProf      11096.95    1191.00   9.317 4.54e-12 ***
sexFemale       907.14     840.54   1.079   0.2862    
year            434.85      78.89   5.512 1.65e-06 ***
ysdeg.deanold -2163.46    1072.04  -2.018   0.0496 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2362 on 45 degrees of freedom
Multiple R-squared:  0.8594,    Adjusted R-squared:  0.8407 
F-statistic: 45.86 on 6 and 45 DF,  p-value: < 2.2e-16

As a result of the confirmation, if all other variables were the same, if ysdeg was old, the salary was 2163.5 lower than if it was new. In addition, the p value of this coefficient is 0.0496, so it is statistically significant at the significance level of 0.05. In other words, the perception that the new dean is more generous in salary is thought to be grounded.

Question 3

Code
library(smss)
Warning: package 'smss' was built under R version 4.2.3
Code
data(house.selling.price)
colnames(house.selling.price)
[1] "case"  "Taxes" "Beds"  "Baths" "New"   "Price" "Size" 
Code
nrow(house.selling.price)
[1] 100

(A)

A multiple regression model was derived with Size and New as independent variables and price as dependent variables.

Code
lm.3.a<-lm(Price~Size+New, data=house.selling.price)
summary(lm.3.a)

Call:
lm(formula = Price ~ Size + New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-205102  -34374   -5778   18929  163866 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -40230.867  14696.140  -2.738  0.00737 ** 
Size           116.132      8.795  13.204  < 2e-16 ***
New          57736.283  18653.041   3.095  0.00257 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared:  0.7226,    Adjusted R-squared:  0.7169 
F-statistic: 126.3 on 2 and 97 DF,  p-value: < 2.2e-16

As a result, as the size increases by one unit, the price increases by 116.1, and in the case of New, the price is 57736.3 higher than that of Old. These are all statistically significant (significance level 0.05).

(B)

For new home: Price = 17505.42 + 116.132Size For old home; Price = -40230.867 + 116.132Size

This relationship can also be seen in the figure below.

Code
library(ggplot2)
ggplot(house.selling.price, aes(Size, Price, color=as.factor(New)),
     xlab="Size",
     ylab="Price") + 
  geom_point() +
  scale_color_manual(values = c("red", "blue"), label = c("Old", "New")) +
  geom_abline(slope=116.132, intercept = -40230.867, color ="red") +
  geom_abline(slope=116.132, intercept = -40230.867+57736.283, color="blue")

In interpretation, regardless of whether the house is new or old, the increase in price is the same as the unit increase in size. However, the intercept of new is 17505.42, which is as large as 57736.283 compared to the intercept of old. That is, the slope of the regression line representing each case is the same, but the new regression line is vertically higher than the old regression line by 57736.283.

(C)

Code
Size=3000
new.home.Price <- 17505.42 + 116.132*Size
old.home.Price <- -40230.867 + 116.132*Size
new.home.Price
[1] 365901.4
Code
old.home.Price
[1] 308165.1

According to fitted model, when size is 3000, predicted selling price for a new home is 365901.4 and for an old home is 308165.1

(D)

Code
lm.3.d<-lm(Price~Size*New, data=house.selling.price)
summary(lm.3.d)

Call:
lm(formula = Price ~ Size * New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-175748  -28979   -6260   14693  192519 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -22227.808  15521.110  -1.432  0.15536    
Size           104.438      9.424  11.082  < 2e-16 ***
New         -78527.502  51007.642  -1.540  0.12697    
Size:New        61.916     21.686   2.855  0.00527 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 52000 on 96 degrees of freedom
Multiple R-squared:  0.7443,    Adjusted R-squared:  0.7363 
F-statistic: 93.15 on 3 and 96 DF,  p-value: < 2.2e-16

In the regression model that includes the interaction term, the price is rather lowered when new(-78527.5). However, according to the interaction term between Size and new, in the case of new, the sales price increases by 61.916 as the size increases by one unit. Even if it is new, the price is lower than old until it reaches a specific size, and it can be seen that the price of new is higher than old only when it exceeds that specific size.

(E)

For new home: Price = -100755.3 + 166.354Size For old home; Price = -22227.808 + 104.438Size

Code
ggplot(house.selling.price, aes(Size, Price, color=as.factor(New)),
     xlab="Size",
     ylab="Price") + 
  geom_point() +
  scale_color_manual(values = c("red", "blue"), label = c("Old", "New")) +
  geom_abline(slope=104.438, intercept = -22227.808, color ="red") +
  geom_abline(slope=166.354, intercept = -100755.3, color="blue")

As shown in the figure, in the case of new, the intercept is lower than old, but the slope of new is larger than that of old.

(F)

Code
Size=3000
new.home.Price.2 <- -100755.3 + 166.354*Size
old.home.Price.2 <- -22227.808 + 104.438*Size
new.home.Price.2
[1] 398306.7
Code
old.home.Price.2
[1] 291086.2

Predicted selling price of new home is 398306.7 and of new home is 291086.2.

(G)

Code
Size=1500
new.home.Price.2 <- -100755.3 + 166.354*Size
old.home.Price.2 <- -22227.808 + 104.438*Size
new.home.Price.2
[1] 148775.7
Code
old.home.Price.2
[1] 134429.2

Predicted selling price of new home is 148775.7 and of new home is 134429.2. The price difference is smaller than when the size is 3000. In the case of new, the slope of the model is larger than that of old, so the larger the size, the higher the price of new. In other words, as the size increases, the price difference between new and old will gradually increase, which can be seen from the regression equation between the two and the graph above.

(H)

Looking at the r-square value of each model, the r-square value of the model including the interaction term is larger(model with interaction term: 0.7443, model without it: 0.7226). Even looking at the adjusted r-square value that corrects the increased amount of explanation as the variable increasesr(model with interaction term: 0.7363, model without it: 0.7169), it can be said that the explanatory power of the model including the interaction term is higher. However, which model to use and prefer is not determined simply by these explanations. In this case, it was easy to compare the cases of new and old by simply modifying the expression, but if the number of variables increases or there are more than three categories of nominal variables, it will be increasingly difficult to derive or interpret the regression model. In other words, in selecting models and variables, various variables such as interpretability as well as simple explanatory power should be considered.

For this case, I prefer a model with interaction terms. It is interesting that the older the house, the smaller the price increase due to the increase in its area, in that the two variables interact to change the influence of the variable.