HW_4_Diana_Rinker

Author

Diana_Rinker


DACSS 603, spring 2023

Homework 4, Diana Rinker.

Loading necessary libraries:

library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(ggplot2)
library(alr4)
Loading required package: car
Loading required package: carData

Attaching package: 'car'
The following object is masked from 'package:dplyr':

    recode
Loading required package: effects
lattice theme set by effectsTheme()
See ?effectsTheme for details.
library(smss)
data(UN11)

Question 1

For recent data in Jacksonville, Florida, on y = selling price of home (in dollars), x1 = size of home (in square feet), and x2 = lot size (in square feet), the prediction equation is ŷ = −10,536 + 53.8x1 + 2.84x2.

A. A particular home of 1240 square feet on a lot of 18,000 square feet sold for $145,000. Find the predicted selling price and the residual, and interpret.

B. For fixed lot size, how much is the house selling price predicted to increase for each square- foot increase in home size? Why?

C. According to this prediction equation, for fixed home size, how much would lot size need to increase to have the same impact as a one-square-foot increase in home size?

A.

x1<- 1240
x2 <- 18000 
predicted.price <- (-10536) + 53.8*x1 + 2.84*x2
residual <- predicted.price - 145000
table <- cbind( c("prediciton", "residual"), c(predicted.price,residual))
knitr::kable(table) 
prediciton 107296
residual -37704

The residual is negative, meaning that the model underestimated house price.

B.

ŷ = −10,536 + 53.8x1 + 2.84x2. If X2 is fixed, Y increases by 53.8 dollars per each square foot increase. The coeeficient of X1 indicates increase in Y with the change of X1, if the other variables kept the same.

C.

To increase Y by 53.8, X2 would need to increase by 53.8/2.84 = 8.94366.

x2.incr<- 53.8/2.84
x2.incr
[1] 18.94366

Question 2

(Data file: salary in alr4 R package). The data file concerns salary and other characteristics of all faculty in a small Midwestern college collected in the early 1980s for presentation in legal proceedings for which discrimination against women in salary was at issue. All persons in the data hold tenured or tenure track positions; temporary faculty are not included. The variables include degree, a factor with levels PhD and MS; rank, a factor with levels Asst, Assoc, and Prof; sex, a factor with levels Male and Female; Year, years in current rank; ysdeg, years since highest degree, and salary, academic year salary in dollars.

A.

Test the hypothesis that the mean salary for men and women is the same, without regard to any other variable but sex. Explain your findings.

data(salary)
str(salary)
'data.frame':   52 obs. of  6 variables:
 $ degree: Factor w/ 2 levels "Masters","PhD": 1 1 1 1 2 1 2 1 2 2 ...
 $ rank  : Factor w/ 3 levels "Asst","Assoc",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ sex   : Factor w/ 2 levels "Male","Female": 1 1 1 2 1 1 2 1 1 1 ...
 $ year  : int  25 13 10 7 19 16 0 16 13 13 ...
 $ ysdeg : int  35 22 23 27 30 21 32 18 30 31 ...
 $ salary: int  36350 35350 28200 26775 33696 28516 24900 31909 31850 32850 ...

First, I will make sure that variances in male and female groups are equal:

salary$sex
 [1] Male   Male   Male   Female Male   Male   Female Male   Male   Male  
[11] Male   Male   Male   Male   Male   Male   Male   Male   Male   Male  
[21] Male   Male   Male   Female Male   Male   Male   Female Male   Male  
[31] Female Male   Male   Female Female Male   Female Male   Male   Male  
[41] Male   Male   Male   Female Male   Male   Female Female Male   Female
[51] Female Female
Levels: Male Female
Male <- subset(salary, sex == "Male")
Female<- subset(salary, sex == "Female")
var.test(Male$salary, Female$salary)

    F test to compare two variances

data:  Male$salary and Female$salary
F = 0.84242, num df = 37, denom df = 13, p-value = 0.6525
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.3015275 1.9189474
sample estimates:
ratio of variances 
         0.8424225 

P>0.05 means that the variances of thwo samples can be considered equal.

salary$sex
 [1] Male   Male   Male   Female Male   Male   Female Male   Male   Male  
[11] Male   Male   Male   Male   Male   Male   Male   Male   Male   Male  
[21] Male   Male   Male   Female Male   Male   Male   Female Male   Male  
[31] Female Male   Male   Female Female Male   Female Male   Male   Male  
[41] Male   Male   Male   Female Male   Male   Female Female Male   Female
[51] Female Female
Levels: Male Female
Male <- subset(salary, sex == "Male")
Female<- subset(salary, sex == "Female")
t.test(Male$salary, Female$salary, var.equal=T, Alternative="two.soded")

    Two Sample t-test

data:  Male$salary and Female$salary
t = 1.8474, df = 50, p-value = 0.0706
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -291.257 6970.550
sample estimates:
mean of x mean of y 
 24696.79  21357.14 

The p-value of a t-test is above 0.05, which means that we accept H0 that the means of these two groups ere the same.

B.

Run a multiple linear regression with salary as the outcome variable and everything else as predictors, including sex. Assuming no interactions between sex and the other predictors, obtain a 95% confidence interval for the difference in salary between males and females.

str(salary)
'data.frame':   52 obs. of  6 variables:
 $ degree: Factor w/ 2 levels "Masters","PhD": 1 1 1 1 2 1 2 1 2 2 ...
 $ rank  : Factor w/ 3 levels "Asst","Assoc",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ sex   : Factor w/ 2 levels "Male","Female": 1 1 1 2 1 1 2 1 1 1 ...
 $ year  : int  25 13 10 7 19 16 0 16 13 13 ...
 $ ysdeg : int  35 22 23 27 30 21 32 18 30 31 ...
 $ salary: int  36350 35350 28200 26775 33696 28516 24900 31909 31850 32850 ...
summary ( lm(salary ~ degree + rank + sex + year +ysdeg , data = salary ))

Call:
lm(formula = salary ~ degree + rank + sex + year + ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 15746.05     800.18  19.678  < 2e-16 ***
degreePhD    1388.61    1018.75   1.363    0.180    
rankAssoc    5292.36    1145.40   4.621 3.22e-05 ***
rankProf    11118.76    1351.77   8.225 1.62e-10 ***
sexFemale    1166.37     925.57   1.260    0.214    
year          476.31      94.91   5.018 8.65e-06 ***
ysdeg        -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16

Confidence interval for the difference in salary between males and females:

confint(model, "sexMale", level = 0.95)
Error in confint(model, "sexMale", level = 0.95): object 'model' not found

C.

Interpret your finding for each predictor variable; discuss (a) statistical significance, (b) interpretation of the coefficient / slope in relation to the outcome variable and other variables

str(salary)
'data.frame':   52 obs. of  6 variables:
 $ degree: Factor w/ 2 levels "Masters","PhD": 1 1 1 1 2 1 2 1 2 2 ...
 $ rank  : Factor w/ 3 levels "Asst","Assoc",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ sex   : Factor w/ 2 levels "Male","Female": 1 1 1 2 1 1 2 1 1 1 ...
 $ year  : int  25 13 10 7 19 16 0 16 13 13 ...
 $ ysdeg : int  35 22 23 27 30 21 32 18 30 31 ...
 $ salary: int  36350 35350 28200 26775 33696 28516 24900 31909 31850 32850 ...
summary ( lm(salary ~ degree + rank + sex + year +ysdeg , data = salary ))

Call:
lm(formula = salary ~ degree + rank + sex + year + ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 15746.05     800.18  19.678  < 2e-16 ***
degreePhD    1388.61    1018.75   1.363    0.180    
rankAssoc    5292.36    1145.40   4.621 3.22e-05 ***
rankProf    11118.76    1351.77   8.225 1.62e-10 ***
sexFemale    1166.37     925.57   1.260    0.214    
year          476.31      94.91   5.018 8.65e-06 ***
ysdeg        -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16

degree: shows no significance in the model (p-value is above 0.05), meaning that this variable cannot explain or predict salary. Rank variable shows high level of significance for all of its three levels (p-value lower than 0.0001), meaning that the rank influences the salary. The rank of the professor increases the salary twice more than the rank of the Associate professor.

Sex variable with two levels (male and female) has no significance and therefore does not contribute to the salary value.

Year shows high significance (low p-value), but has the lowest coefficient to the salary value.

Ysdeg shows no significance and therefore does not contribute to the salary.

D.

Change the baseline category for the rank variable. Interpret the coefficients related to rank again.

str(salary)
'data.frame':   52 obs. of  6 variables:
 $ degree: Factor w/ 2 levels "Masters","PhD": 1 1 1 1 2 1 2 1 2 2 ...
 $ rank  : Factor w/ 3 levels "Asst","Assoc",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ sex   : Factor w/ 2 levels "Male","Female": 1 1 1 2 1 1 2 1 1 1 ...
 $ year  : int  25 13 10 7 19 16 0 16 13 13 ...
 $ ysdeg : int  35 22 23 27 30 21 32 18 30 31 ...
 $ salary: int  36350 35350 28200 26775 33696 28516 24900 31909 31850 32850 ...
levels(salary$rank)
[1] "Asst"  "Assoc" "Prof" 
salary$rank <- relevel(salary$rank, ref = "Prof")
summary ( lm(salary ~ degree + rank + sex + year +ysdeg , data = salary ))

Call:
lm(formula = salary ~ degree + rank + sex + year + ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  26864.81    1375.29  19.534  < 2e-16 ***
degreePhD     1388.61    1018.75   1.363    0.180    
rankAsst    -11118.76    1351.77  -8.225 1.62e-10 ***
rankAssoc    -5826.40    1012.93  -5.752 7.28e-07 ***
sexFemale     1166.37     925.57   1.260    0.214    
year           476.31      94.91   5.018 8.65e-06 ***
ysdeg         -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16

I used relevel() function tochange base cathegory for Rank from Asssistant to Professor. Now we can see that ssostant professor earns on average $11118 less than professor, and Associate makes $5826 less than professor.

E.

Finkelstein (1980), in a discussion of the use of regression in discrimination cases, wrote, “[a] variable may reflect a position or status bestowed by the employer, in which case if there is discrimination in the award of the position or status, the variable may be ‘tainted.’” Thus, for example, if discrimination is at work in promotion of faculty to higher ranks, using rank to adjust salaries before comparing the sexes may not be acceptable to the courts.

Exclude the variable rank, refit, and summarize how your findings changed, if they did.

# first, I return value "Asst" to be baseline, because it has smallest salaries. 
levels(salary$rank)
[1] "Prof"  "Asst"  "Assoc"
salary$rank <- relevel(salary$rank, ref = "Asst")

#excluding rank from the model 
summary ( lm(salary ~ degree + sex + year +ysdeg , data = salary ))

Call:
lm(formula = salary ~ degree + sex + year + ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-8146.9 -2186.9  -491.5  2279.1 11186.6 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 17183.57    1147.94  14.969  < 2e-16 ***
degreePhD   -3299.35    1302.52  -2.533 0.014704 *  
sexFemale   -1286.54    1313.09  -0.980 0.332209    
year          351.97     142.48   2.470 0.017185 *  
ysdeg         339.40      80.62   4.210 0.000114 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3744 on 47 degrees of freedom
Multiple R-squared:  0.6312,    Adjusted R-squared:  0.5998 
F-statistic: 20.11 on 4 and 47 DF,  p-value: 1.048e-09

Once I excluded rank from the model, the variable sex became significant. I interpret it, as the variable rank was taking “covering” variable sex. It the previous model rank had very high significance, and now “sex” became significant, supporting Finkelstein’s hypothesis.

F.

Everyone in this dataset was hired the year they earned their highest degree.

ysdeg = time since hired

It is also known that a new Dean was appointed 15 years ago, and everyone in the dataset who earned their highest degree 15 years ago or less than that has been hired by the new Dean.

separate thos who been hired by him ysdeg <=15 yrs - > two groups compare

Some people have argued that the new Dean has been making offers that are a lot more generous to newly hired faculty than the previous one and that this might explain some of the variation in Salary.

Create a new variable that would allow you to test this hypothesis and run another multiple regression model to test this.

Select variables carefully to make sure there is no multicollinearity.

Explain why multicollinearity would be a concern in this case and how you avoided it.

Do you find support for the hypothesis that the people hired by the new Dean are making higher than those that were not?

# ysdeg = time since hired
salary<-salary%>%
  mutate (new.hire = ifelse (ysdeg >15, 0, 1))

Since new.hire variable is derived from ysdeg, I will remove ysdeg from the model to avoid multicollinearity:

summary ( lm(salary ~ degree + sex +  rank + year + new.hire, data = salary ))

Call:
lm(formula = salary ~ degree + sex + rank + year + new.hire, 
    data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-3403.3 -1387.0  -167.0   528.2  9233.8 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 13328.38    1483.38   8.985 1.33e-11 ***
degreePhD     818.93     797.48   1.027   0.3100    
sexFemale     907.14     840.54   1.079   0.2862    
rankProf    11096.95    1191.00   9.317 4.54e-12 ***
rankAssoc    4972.66     997.17   4.987 9.61e-06 ***
year          434.85      78.89   5.512 1.65e-06 ***
new.hire     2163.46    1072.04   2.018   0.0496 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2362 on 45 degrees of freedom
Multiple R-squared:  0.8594,    Adjusted R-squared:  0.8407 
F-statistic: 45.86 on 6 and 45 DF,  p-value: < 2.2e-16
# old summary ( lm(salary ~ degree + sex + year +ysdeg , data = salary ))

The model is showing significance for the new.hire variable, supporting the hypothesis tha tpeople hired by the new Dean are paid better.

Question 3

(Data file: house.selling.price in smss R package)

data(house.selling.price)

A.

Using the house.selling.price data, run and report regression results modeling y = selling price (in dollars) in terms of size of home (in square feet) and whether the home is new (1 = yes; 0 = no). In particular, for each variable; discuss statistical significance and interpret the meaning of the coefficient.

data(house.selling.price)
str(house.selling.price)
'data.frame':   100 obs. of  7 variables:
 $ case : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Taxes: int  3104 1173 3076 1608 1454 2997 4054 3002 6627 320 ...
 $ Beds : int  4 2 4 3 3 3 3 3 5 3 ...
 $ Baths: int  2 1 2 2 3 2 2 2 4 2 ...
 $ New  : int  0 0 0 0 0 1 0 1 0 0 ...
 $ Price: int  279900 146500 237700 200000 159900 499900 265500 289900 587000 70000 ...
 $ Size : int  2048 912 1654 2068 1477 3153 1355 2075 3990 1160 ...
summary (lm(Price~ Size + New, data = house.selling.price))

Call:
lm(formula = Price ~ Size + New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-205102  -34374   -5778   18929  163866 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -40230.867  14696.140  -2.738  0.00737 ** 
Size           116.132      8.795  13.204  < 2e-16 ***
New          57736.283  18653.041   3.095  0.00257 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared:  0.7226,    Adjusted R-squared:  0.7169 
F-statistic: 126.3 on 2 and 97 DF,  p-value: < 2.2e-16

The model is demostrating statisticsl significance of both independent variable, with the size variable having a better significance level. Size coefficient of 115 indicates that with an increase of size by square foot, the price of the house would increase by 116 dollars.

At the same time, the coefficient of “New” variable is much bgger than Size, meaning that the the new house would be 57736 dollars more expencive than the old one of the same size.

B.

Report and interpret the prediction equation, and form separate equations relating selling price to size for new and for not new homes.

Prediction equation is: Price = (116 * Size) + (57736 * New)

to form separate equations for new and not new houses, I will split dataset into two by the “New” variable:

new.houses<- house.selling.price%>%
  filter(New==1)
old.houses<- house.selling.price%>%
  filter(New==0)
new.model<- (lm(Price~ Size , data = new.houses))
new.model

Call:
lm(formula = Price ~ Size, data = new.houses)

Coefficients:
(Intercept)         Size  
  -100755.3        166.4  
old.model<- (lm(Price~ Size , data = old.houses))
old.model

Call:
lm(formula = Price ~ Size, data = old.houses)

Coefficients:
(Intercept)         Size  
   -22227.8        104.4  

Model for New houses: price = 166* Size

Model for Old houses: price = 104* Size

C. Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.

New house of size 3000

Size<- 3000
new.house<- 166*Size 
new.house
[1] 498000

Old house of size 3000

old.house <- 104*Size
old.house
[1] 312000

D. Fit another model, this time with an interaction term allowing interaction between size and new, and report the regression results

summary (lm(Price~ Size + New +Size*New, data = house.selling.price))

Call:
lm(formula = Price ~ Size + New + Size * New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-175748  -28979   -6260   14693  192519 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -22227.808  15521.110  -1.432  0.15536    
Size           104.438      9.424  11.082  < 2e-16 ***
New         -78527.502  51007.642  -1.540  0.12697    
Size:New        61.916     21.686   2.855  0.00527 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 52000 on 96 degrees of freedom
Multiple R-squared:  0.7443,    Adjusted R-squared:  0.7363 
F-statistic: 93.15 on 3 and 96 DF,  p-value: < 2.2e-16

Once we added interaction term between New and size, the significance of “New” variable disappeared. Instead, the interaction term appears to explain the change in price better, i.e. combination of size and newness of the house impacts the price.

E. Report the lines relating the predicted selling price to the size for homes that are (i) new, (ii) not new.

house.selling.price<- house.selling.price%>%
  mutate (predicted.price = predict(lm(Price~ Size + New +Size*New)))

new.houses<- house.selling.price%>%
  filter(New==1)
ggplot(new.houses, mapping=aes(x=Size, y=predicted.price))+
  geom_smooth(method="lm")+
  labs(title=" Predicted selling price to the size for  new homes", x="Size", y="Predicted price")
`geom_smooth()` using formula = 'y ~ x'

old.houses<- house.selling.price%>%
  filter(New==0)
ggplot(old.houses, mapping=aes(x=Size, y=predicted.price))+
  geom_smooth(method="lm")+
  labs(title=" Predicted selling price to the size for old homes", x="Size", y="Predicted price")
`geom_smooth()` using formula = 'y ~ x'

F. Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.

New:

model <-lm(Price~ Size + New +Size*New, data = house.selling.price)

predicted1.y.new <- predict(model, newdata = data.frame(Size = 3000, New = 1))
predicted1.y.new 
       1 
398307.5 

Old:

predicted1.y.old <- predict(model, newdata = data.frame(Size = 3000, New = 0))
predicted1.y.old 
       1 
291087.4 

G. Find the predicted selling price for a home of 1500 square feet that is (i) new, (ii) not new. Comparing to (F), explain how the difference in predicted selling prices changes as the size of home increases.

predicted2.y.new <- predict(model, newdata = data.frame(Size = 1500, New = 1))
predicted2.y.new 
       1 
148776.1 
predicted2.y.old <- predict(model, newdata = data.frame(Size = 1500, New = 0))
predicted2.y.old
       1 
134429.8 

The difference in prices between old and new house for F (3000sq.f) is

dif.1 <- predicted1.y.new -predicted1.y.old 
dif.1
       1 
107220.1 

The difference in prices between old and new house for G (1500sq.f) is

dif.2 <- predicted2.y.new -predicted2.y.old 
dif.2
       1 
14346.32 

Apparently the difference between old and new houses of the same square footage is increasing with increase of size.

H. Do you think the model with interaction or the one without it represents the relationship of size and new to the outcome price? What makes you prefer one model over another?

Clearly the model with interaction represents relationship better, because difference in prices of larger houses is bigger than difference in prices for smaller houses.