Kristin Abijaoude_HW4

Hw4
kristin abijaoude
Published

April 22, 2023

Question 1

ŷ = −10,536 + 53.8x1 + 2.84x2

y = selling price of home (in dollars) x1 = size of home (in square feet) x2 = lot size (in square feet)

A. particular home of 1240 square feet on a lot of 18,000 square feet sold for $145,000. Find the predicted selling price and the residual, and interpret.

Code
price = -10536 + (53.8 * 1240) + (2.84 * 18000)
print(price)
[1] 107296
Code
residual = price - 145000
print(residual)
[1] -37704

The predicted price is $107,296, which is much under the $145,000 price sold by -$37,704.

B. For fixed lot size, how much is the house selling price predicted to increase for each square- foot increase in home size? Why?

It’s $53.80 for every 1x increase in square feet of the selling house, as shown in the equation as a coefficient of 1x.

C. According to this prediction equation, for fixed home size, how much would lot size need to increase to have the same impact as a one-square-foot increase in home size?

You need to increase the lot size by 18.94 sq feet to have the same impact as a one square foot increase.

Code
# x1 / x2

lot = 53.80 / 2.84
print(lot)
[1] 18.94366

Question 2

Code
library(alr4)
Loading required package: car
Loading required package: carData
Loading required package: effects
lattice theme set by effectsTheme()
See ?effectsTheme for details.
Code
data(salary, package = "alr4")
head(salary)
   degree rank    sex year ysdeg salary
1 Masters Prof   Male   25    35  36350
2 Masters Prof   Male   13    22  35350
3 Masters Prof   Male   10    23  28200
4 Masters Prof Female    7    27  26775
5     PhD Prof   Male   19    30  33696
6 Masters Prof   Male   16    21  28516

A. Test the hypothesis that the mean salary for men and women is the same, without regard to any other variable but sex.

With a p-value of 0.09, we can conclude that the mean salary for both men and women is the same, thus accepting the null hypothesis.

Code
sex <- select(salary, c(sex))
Error in select(salary, c(sex)): could not find function "select"
Code
salary1 <-  select(salary, c(salary))
Error in select(salary, c(salary)): could not find function "select"
Code
sex <- as.numeric(unlist(sex))
Error in unlist(sex): object 'sex' not found
Code
salary1 <- as.numeric(unlist(salary1))
Error in unlist(salary1): object 'salary1' not found
Code
mean <- t.test(salary1 ~ sex, var.equal = FALSE, alternative = "two.sided")
Error in eval(predvars, data, env): object 'salary1' not found
Code
print(mean)
function (x, ...) 
UseMethod("mean")
<bytecode: 0x10ac0f270>
<environment: namespace:base>

B. Run a multiple linear regression with salary as the outcome variable and everything else as predictors, including sex.

Code
model <- lm(formula = salary ~ ., data = salary)
summary(model)

Call:
lm(formula = salary ~ ., data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 15746.05     800.18  19.678  < 2e-16 ***
degreePhD    1388.61    1018.75   1.363    0.180    
rankAssoc    5292.36    1145.40   4.621 3.22e-05 ***
rankProf    11118.76    1351.77   8.225 1.62e-10 ***
sexFemale    1166.37     925.57   1.260    0.214    
year          476.31      94.91   5.018 8.65e-06 ***
ysdeg        -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16
Code
confint(model, 'sexFemale', level=0.95)
              2.5 %   97.5 %
sexFemale -697.8183 3030.565
Code
confint(model, level=0.95)
                 2.5 %      97.5 %
(Intercept) 14134.4059 17357.68946
degreePhD    -663.2482  3440.47485
rankAssoc    2985.4107  7599.31080
rankProf     8396.1546 13841.37340
sexFemale    -697.8183  3030.56452
year          285.1433   667.47476
ysdeg        -280.6397    31.49105

C. Interpret your finding for each predictor variable; discuss (a) statistical significance, (b) interpretation of the coefficient / slope in relation to the outcome variable and other variables

degreePD - This means that, all else being equal, PhD holders can expect to earn an average of $1,388.60 more in salary.

rankAssoc - Associate professors can expect to earn an average of $5,292.40 more in salary.

rankProf - Professors would earn an average of $11,118.80 more in salary.

sexFemale - Female faculty workers would earn an average of $1,166.4 more in salary.

year - The longer a faculty member works at a college, the more they earn in salary, with an average of $476.30 in increase.

ysdeg - However, if it’s been several years since you earned your last degree, expect a decrease of -$124.60 in salary on average.

D. Change the baseline category for the rank variable. Interpret the coefficients related to rank again.

Code
salary$rank <- relevel(salary$rank, ref = "Prof")
model2 <- lm(formula = salary ~ ., data = salary)
summary(model2)

Call:
lm(formula = salary ~ ., data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  26864.81    1375.29  19.534  < 2e-16 ***
degreePhD     1388.61    1018.75   1.363    0.180    
rankAsst    -11118.76    1351.77  -8.225 1.62e-10 ***
rankAssoc    -5826.40    1012.93  -5.752 7.28e-07 ***
sexFemale     1166.37     925.57   1.260    0.214    
year           476.31      94.91   5.018 8.65e-06 ***
ysdeg         -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16

In the new model, we can see that assistant professors and associate professors lose an average of -$11,118.80 and -$5,826.40 in salary, respectively.

E Removing rank from the model

Code
model3 <- lm(formula = salary ~ degree + sex + year + ysdeg, data = salary)
summary(model3)

Call:
lm(formula = salary ~ degree + sex + year + ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-8146.9 -2186.9  -491.5  2279.1 11186.6 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 17183.57    1147.94  14.969  < 2e-16 ***
degreePhD   -3299.35    1302.52  -2.533 0.014704 *  
sexFemale   -1286.54    1313.09  -0.980 0.332209    
year          351.97     142.48   2.470 0.017185 *  
ysdeg         339.40      80.62   4.210 0.000114 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3744 on 47 degrees of freedom
Multiple R-squared:  0.6312,    Adjusted R-squared:  0.5998 
F-statistic: 20.11 on 4 and 47 DF,  p-value: 1.048e-09

When removing rank from the equation, we can see a decrease in the average salary for PhD holders and female faculty members by -$3,299.30 and -$1,286.50, respectively. In addition, the salary average for year is smaller than before, but it’s still an increase. The biggest shift is ysdeg with $339.40 increase in salary on average.

F New variable, new hypothesis

Some people have argued that the new Dean has been making offers that are a lot more generous to newly hired faculty than the previous one and that this might explain some of the variation in Salary.

Code
salary <- salary %>%
  mutate(ysdeg15 = ifelse(ysdeg <= 15, 1,0))
Error in salary %>% mutate(ysdeg15 = ifelse(ysdeg <= 15, 1, 0)): could not find function "%>%"
Code
salary
    degree  rank    sex year ysdeg salary
1  Masters  Prof   Male   25    35  36350
2  Masters  Prof   Male   13    22  35350
3  Masters  Prof   Male   10    23  28200
4  Masters  Prof Female    7    27  26775
5      PhD  Prof   Male   19    30  33696
6  Masters  Prof   Male   16    21  28516
7      PhD  Prof Female    0    32  24900
8  Masters  Prof   Male   16    18  31909
9      PhD  Prof   Male   13    30  31850
10     PhD  Prof   Male   13    31  32850
11 Masters  Prof   Male   12    22  27025
12 Masters Assoc   Male   15    19  24750
13 Masters  Prof   Male    9    17  28200
14     PhD Assoc   Male    9    27  23712
15 Masters  Prof   Male    9    24  25748
16 Masters  Prof   Male    7    15  29342
17 Masters  Prof   Male   13    20  31114
18     PhD Assoc   Male   11    14  24742
19     PhD Assoc   Male   10    15  22906
20     PhD  Prof   Male    6    21  24450
21     PhD  Asst   Male   16    23  19175
22     PhD Assoc   Male    8    31  20525
23 Masters  Prof   Male    7    13  27959
24 Masters  Prof Female    8    24  38045
25 Masters Assoc   Male    9    12  24832
26 Masters  Prof   Male    5    18  25400
27 Masters Assoc   Male   11    14  24800
28 Masters  Prof Female    5    16  25500
29     PhD Assoc   Male    3     7  26182
30     PhD Assoc   Male    3    17  23725
31     PhD  Asst Female   10    15  21600
32     PhD Assoc   Male   11    31  23300
33     PhD  Asst   Male    9    14  23713
34     PhD Assoc Female    4    33  20690
35     PhD Assoc Female    6    29  22450
36 Masters Assoc   Male    1     9  20850
37 Masters  Asst Female    8    14  18304
38 Masters  Asst   Male    4     4  17095
39 Masters  Asst   Male    4     5  16700
40 Masters  Asst   Male    4     4  17600
41 Masters  Asst   Male    3     4  18075
42     PhD  Asst   Male    3    11  18000
43 Masters Assoc   Male    0     7  20999
44 Masters  Asst Female    3     3  17250
45 Masters  Asst   Male    2     3  16500
46 Masters  Asst   Male    2     1  16094
47 Masters  Asst Female    2     6  16150
48 Masters  Asst Female    2     2  15350
49 Masters  Asst   Male    1     1  16244
50 Masters  Asst Female    1     1  16686
51 Masters  Asst Female    1     1  15000
52 Masters  Asst Female    0     2  20300
Code
model4 <- lm(formula = salary ~ degree + sex + year + ysdeg15, data = salary)
Error in eval(predvars, data, env): object 'ysdeg15' not found
Code
print(model4)
Error in print(model4): object 'model4' not found
Code
cor.test(salary$ysdeg, salary$ysdeg15)
Error in cor.test.default(salary$ysdeg, salary$ysdeg15): 'y' must be a numeric vector

I took out the ysdeg variable as they’re too similar to the new variable I created ysdeg15 to avoid multicollinearity.

The correlation is -0.8434239, in which we can reject that alternative hypothesis. In other words, there are no changes in salary average.

Question 3

Code
library(smss)
data("house.selling.price", package = "smss")
head(house.selling.price)
  case Taxes Beds Baths New  Price Size
1    1  3104    4     2   0 279900 2048
2    2  1173    2     1   0 146500  912
3    3  3076    4     2   0 237700 1654
4    4  1608    3     2   0 200000 2068
5    5  1454    3     3   0 159900 1477
6    6  2997    3     2   1 499900 3153

A. Using the house.selling.price data, run and report regression results modeling y = selling price (in dollars) in terms of size of home (in square feet) and whether the home is new (1 = yes; 0 = no).

Code
selling <- lm(formula = Price ~ Size + New, data = house.selling.price)
summary(selling)

Call:
lm(formula = Price ~ Size + New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-205102  -34374   -5778   18929  163866 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -40230.867  14696.140  -2.738  0.00737 ** 
Size           116.132      8.795  13.204  < 2e-16 ***
New          57736.283  18653.041   3.095  0.00257 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared:  0.7226,    Adjusted R-squared:  0.7169 
F-statistic: 126.3 on 2 and 97 DF,  p-value: < 2.2e-16

A new house sells at $57,736.30 more, while for every one square foot is sold at $116.10.

B. Report and interpret the prediction equation, and form separate equations relating selling price to size for new and for not new homes.

y = -40,230 + 116.10x1 + 57,736.30x2

C. Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.

A not new house would be sold at $308,070

Code
price1 = -40230 + (116.10 * 3000) + (57736.30 * 0)
print(price1)
[1] 308070

D. Fit another model, this time with an interaction term allowing interaction between size and new, and report the regression results

Code
selling2 <- lm(formula = Price ~ Size*New, data = house.selling.price)
summary(selling2)

Call:
lm(formula = Price ~ Size * New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-175748  -28979   -6260   14693  192519 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -22227.808  15521.110  -1.432  0.15536    
Size           104.438      9.424  11.082  < 2e-16 ***
New         -78527.502  51007.642  -1.540  0.12697    
Size:New        61.916     21.686   2.855  0.00527 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 52000 on 96 degrees of freedom
Multiple R-squared:  0.7443,    Adjusted R-squared:  0.7363 
F-statistic: 93.15 on 3 and 96 DF,  p-value: < 2.2e-16

E. Report the lines relating the predicted selling price to the size for homes that are (i) new, (ii) not new.

Code
# new
price2 = -22227.808 + (104.44 * 3000 ) - (78527.50 * 1) + (61.92 * 1)
price2
[1] 212626.6
Code
# not new 
price3 = -22227.808 + (104.44 * 3000 ) + 61.92
price3
[1] 291154.1

F. Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.

A new house would sell at $212,626.60, while a not new house would sell at $291,154.10.

G. Find the predicted selling price for a home of 1500 square feet that is (i) new, (ii) not new. Comparing to (F), explain how the difference in predicted selling prices changes as the size of home increases.

Code
# new
price4 = -40230 + (116.10 * 1500) + (57736.30 * 1)
price4
[1] 191656.3
Code
# not new 
price5 = -40230 + (116.10 * 1500) 
price5
[1] 133920

A new house would sell at $191,656.30, less than the $212,626.60 in Question F, and a not new house would sell at $133,920, also less than the $291,154.10 in Question F as well.

H. Do you think the model with interaction or the one without it represents the relationship of size and new to the outcome price? What makes you prefer one model over another?

I think the model without the interaction represents the relationship of size and newness to the outcome price because it’s simpler, easier to interpret, and there is more statistical significance as opposed to the model with the interaction. Despite that, the RSS, R-square, and adjusted R-square are similar.