hw1
challenge1
my name
dataset
ggplot2
Author

Paritosh G

Published

May 28, 2023

Code
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.1     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
Code
library(alr4)
Loading required package: car
Loading required package: carData

Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some

Loading required package: effects
lattice theme set by effectsTheme()
See ?effectsTheme for details.
Code
library(ggplot2)
library(smss)
library(magrittr)

Attaching package: 'magrittr'

The following object is masked from 'package:purrr':

    set_names

The following object is masked from 'package:tidyr':

    extract

Q.1)

Code
 y <- function(x1,x2) { 53.8*x1 + 2.84*x2 - 10536}

a)

Code
predicted <- y(x1 = 1240, x2 = 18000)

print(predicted)
[1] 107296
Code
print( 145000 - predicted)
[1] 37704

The model is predicting the selling price of the house to be less than actual hence the residual is positive.

b)

  • For each square foot increase in home for a particular lot size which is also present in the model the price will increase by 53.8$

c)

  • One square foot increase in lot size of the house will increase the price a by 2.84 dollars. to increase the price to 53.8dollars. we need to increase the size of houses by 53.8/2.84 equals 18.94 square feet.

Q.2)

data

Code
data(salary)

a)

qs per question using variable sex

Code
model_1 <- lm(salary ~ sex, data = salary)
summary(model_1)

Call:
lm(formula = salary ~ sex, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-8602.8 -4296.6  -100.8  3513.1 16687.9 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)    24697        938  26.330   <2e-16 ***
sexFemale      -3340       1808  -1.847   0.0706 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5782 on 50 degrees of freedom
Multiple R-squared:  0.0639,    Adjusted R-squared:  0.04518 
F-statistic: 3.413 on 1 and 50 DF,  p-value: 0.0706

negative value of coefficient suggests Female’s are being paid less by that amount than male colleagues which is 3340 the variable is significant at 10% level but not at 5% level.

b)

model for all predictors

Code
model_2 <- lm(salary ~ degree + rank + year + ysdeg , data = salary) 
summary(model_2)

Call:
lm(formula = salary ~ degree + rank + year + ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4226.9  -972.1  -293.1   612.5  9840.8 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 16311.34     666.84  24.461  < 2e-16 ***
degreePhD    1062.12     991.53   1.071    0.290    
rankAssoc    4713.92    1056.09   4.464 5.18e-05 ***
rankProf    10509.62    1270.43   8.272 1.18e-10 ***
year          416.56      82.75   5.034 7.84e-06 ***
ysdeg         -81.22      69.87  -1.162    0.251    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2414 on 46 degrees of freedom
Multiple R-squared:  0.8499,    Adjusted R-squared:  0.8336 
F-statistic:  52.1 on 5 and 46 DF,  p-value: < 2.2e-16

c)

Code
confint(model_2)
                 2.5 %      97.5 %
(Intercept) 14969.0604 17653.61080
degreePhD    -933.7253  3057.95602
rankAssoc    2588.1320  6839.71680
rankProf     7952.3705 13066.87214
year          250.0004   583.12728
ysdeg        -221.8611    59.42616

95% confidence interval suggest 697 dollar less or 3031 dollars more for female faculty than male faculties. controlling for other variables.

Code
summary(model_2)

Call:
lm(formula = salary ~ degree + rank + year + ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4226.9  -972.1  -293.1   612.5  9840.8 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 16311.34     666.84  24.461  < 2e-16 ***
degreePhD    1062.12     991.53   1.071    0.290    
rankAssoc    4713.92    1056.09   4.464 5.18e-05 ***
rankProf    10509.62    1270.43   8.272 1.18e-10 ***
year          416.56      82.75   5.034 7.84e-06 ***
ysdeg         -81.22      69.87  -1.162    0.251    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2414 on 46 degrees of freedom
Multiple R-squared:  0.8499,    Adjusted R-squared:  0.8336 
F-statistic:  52.1 on 5 and 46 DF,  p-value: < 2.2e-16
  • rank: The model takes rank as an categorical variable, ignoring its order. The most common practice for ordered categorical variables like rank is to either treat them as just a regular categorical variable or as a numeric variable. The latter is most acceptable when the variable has lots of levels and/or the distance between each level can be reasonably thought of as equal. In this case, because there are only three levels (one more than what a dummy variable has), it makes sense to accept this as a regular categorical variable.

  • degree: The degree variable is insignificant at all level.

The rankAssoc category suggests that Associate Professors make $5292 more than Assistant Professors, rankProf suggests full professors make $11118 more than Assistant Professors.

  • if we want to test significance of whole rank variable and not an individual variable. we need to compare two models one with all variables and second without any rank dummies.
Code
M_3 <- lm(salary ~ ., data = salary)
M_4 <- lm(salary ~ . -rank, data = salary)

anova(M_3, M_4)
Analysis of Variance Table

Model 1: salary ~ degree + rank + sex + year + ysdeg
Model 2: salary ~ (degree + rank + sex + year + ysdeg) - rank
  Res.Df       RSS Df  Sum of Sq     F    Pr(>F)    
1     45 258858365                                  
2     47 658649047 -2 -399790682 34.75 7.485e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • rank variable is significant as a whole.

  • sex: As we saw with confidence intervals, this variable is now not statistically significant at conventional levels. The coefficient suggests female faculty make $1166 more after everything is controlled, but interpreting coefficients when the effect is insignificant is not very meaningful.

  • year: his variable is statistically significant. It suggests that every additional in current rank is associated with $476 more salary.

  • ysdeg: The variable is insignificant. The coefficient would suggest that every additional year that passes since degree is associated with $124 less salary.

d)

Code
salary$rank <- relevel(salary$rank, ref = 'Prof')
summary(lm(salary ~ ., data = salary))

Call:
lm(formula = salary ~ ., data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  26864.81    1375.29  19.534  < 2e-16 ***
degreePhD     1388.61    1018.75   1.363    0.180    
rankAsst    -11118.76    1351.77  -8.225 1.62e-10 ***
rankAssoc    -5826.40    1012.93  -5.752 7.28e-07 ***
sexFemale     1166.37     925.57   1.260    0.214    
year           476.31      94.91   5.018 8.65e-06 ***
ysdeg         -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16

Prof. rankAsst being -11118 means Assistant Professors make 11118 less than Full Professors, controlling for other variables. rankAssoc being -5826 means Associate Professors make 5826 less than Full Professors, controlling for other variables. The same information in the previous model is presented in a different way.

e)

Code
summary(lm(salary ~ . -rank, data = salary))

Call:
lm(formula = salary ~ . - rank, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-8146.9 -2186.9  -491.5  2279.1 11186.6 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 17183.57    1147.94  14.969  < 2e-16 ***
degreePhD   -3299.35    1302.52  -2.533 0.014704 *  
sexFemale   -1286.54    1313.09  -0.980 0.332209    
year          351.97     142.48   2.470 0.017185 *  
ysdeg         339.40      80.62   4.210 0.000114 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3744 on 47 degrees of freedom
Multiple R-squared:  0.6312,    Adjusted R-squared:  0.5998 
F-statistic: 20.11 on 4 and 47 DF,  p-value: 1.048e-09

removing rank variables makes variable sex to be negative but still insignificant.

f)

  • As per question we need to create a new variable called new_var from ys_deg variable. We should not create highly co-related variables to avoid multicollinearity. But, as a we are creating a new variable from already present it is likely that they will be co-related.
Code
salary$new_var <- ifelse(salary$ysdeg <= 15, 1, 0)
cor.test(salary$new_var, salary$ysdeg)

    Pearson's product-moment correlation

data:  salary$new_var and salary$ysdeg
t = -11.101, df = 50, p-value = 4.263e-15
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.9074548 -0.7411040
sample estimates:
       cor 
-0.8434239 
  • new_var and ysdeg are -0.84 co-relation which is very high so we will remove them.
Code
summary(lm(salary ~ . -ysdeg, data = salary))

Call:
lm(formula = salary ~ . - ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-3403.3 -1387.0  -167.0   528.2  9233.8 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  24425.32    1107.52  22.054  < 2e-16 ***
degreePhD      818.93     797.48   1.027   0.3100    
rankAsst    -11096.95    1191.00  -9.317 4.54e-12 ***
rankAssoc    -6124.28    1028.58  -5.954 3.65e-07 ***
sexFemale      907.14     840.54   1.079   0.2862    
year           434.85      78.89   5.512 1.65e-06 ***
new_var       2163.46    1072.04   2.018   0.0496 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2362 on 45 degrees of freedom
Multiple R-squared:  0.8594,    Adjusted R-squared:  0.8407 
F-statistic: 45.86 on 6 and 45 DF,  p-value: < 2.2e-16
  • At 5% significance level the new dean is paying 2163 USD more to the faculties appointed under him.

Let’s see what would have happened if we included both variables:

Code
summary(lm(salary ~ . , data = salary))

Call:
lm(formula = salary ~ ., data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-3621.2 -1336.8  -271.6   530.1  9247.6 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  25179.14    1901.59  13.241  < 2e-16 ***
degreePhD     1135.00    1031.16   1.101    0.277    
rankAsst    -11411.45    1362.02  -8.378 1.16e-10 ***
rankAssoc    -6177.44    1043.04  -5.923 4.39e-07 ***
sexFemale     1084.09     921.49   1.176    0.246    
year           460.35      95.09   4.841 1.63e-05 ***
ysdeg          -47.86      97.71  -0.490    0.627    
new_var       1749.09    1372.83   1.274    0.209    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2382 on 44 degrees of freedom
Multiple R-squared:  0.8602,    Adjusted R-squared:  0.838 
F-statistic: 38.68 on 7 and 44 DF,  p-value: < 2.2e-16
  • none of the variable is significant because of multicollinearity.

Q.3)

Loading the data

Code
data("house.selling.price")

a)

Code
model <- lm(Price ~ Size + New, data = house.selling.price)
summary(model)

Call:
lm(formula = Price ~ Size + New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-205102  -34374   -5778   18929  163866 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -40230.867  14696.140  -2.738  0.00737 ** 
Size           116.132      8.795  13.204  < 2e-16 ***
New          57736.283  18653.041   3.095  0.00257 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared:  0.7226,    Adjusted R-squared:  0.7169 
F-statistic: 126.3 on 2 and 97 DF,  p-value: < 2.2e-16

size and the new variable are associated with price, both are significant at the 5% level. A square foot increase in the size will increase the price of house by usd 116, controlling for whether the house is new. controlling for size new houses are around usd 57736 more expensive.

New houses are on average $57736 more expensive than old houses, controlling for size.

b)

the equation for predicted size of home is price = -40230.867 + 116.132*Size + 57736.283*new

c)

Code
new <- data.frame(Size = 3000, New = 1)

predict(model, newdata = new)
       1 
365900.2 
Code
not_new <- data.frame(Size = 3000, New = 0)

predict(model, newdata = new)
       1 
365900.2 

d)

Code
model_2<- lm(Price ~ Size + New + Size * New, data = house.selling.price)
summary(model_2)

Call:
lm(formula = Price ~ Size + New + Size * New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-175748  -28979   -6260   14693  192519 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -22227.808  15521.110  -1.432  0.15536    
Size           104.438      9.424  11.082  < 2e-16 ***
New         -78527.502  51007.642  -1.540  0.12697    
Size:New        61.916     21.686   2.855  0.00527 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 52000 on 96 degrees of freedom
Multiple R-squared:  0.7443,    Adjusted R-squared:  0.7363 
F-statistic: 93.15 on 3 and 96 DF,  p-value: < 2.2e-16

e)

y = 61.916*size*new + -78527.502*new +104.438*size + -22227.808

for new homes

y = 61.916*size*1 + -78527.502*1 +104.438*size + -22227.808

= 166354*size + -100755.3

for old homes

y = 61.916*size*0 + -78527.502*0 +104.438*size + -22227.808

= 104.438*size + -22227.808

f)

The predicted selling price, using the model with interaction terms, for a home of 3000 square feed that is new is below.

Code
predict(model_2, newdata = new)
       1 
398307.5 

The predicted selling price, using the model with interaction terms, for a home of 3000 square feed that is not new is below.

Code
predict(model_2, newdata = not_new)
       1 
291087.4 

g)

The predicted selling price, using the model with interaction terms, for a home of 1500 square feed that is new is below.

Code
new <- data.frame(Size = 1500, New = 1)

predict(model_2, newdata = new)
       1 
148776.1 

The predicted selling price, using the model with interaction terms, for a home of 1500 square feed that is not new is below.

Code

df_not_new <- data.frame(Size = 1500, New = 0)

predict(fit_3d, newdata = df_not_new)

       1 
134429.8 

In comparing the predictions for part F and G, it can be observed that the difference in selling price between a new and not new home increases as the size of the home increases.

I believe the model with the interaction term best represents the relationship of size and new to the outcome price. I’ve come to this conclusion as the model with the interaction term has a higher adjusted R-squared and lower residual standard error than the model without the interaction term.


Code
not_new <- data.frame(Size = 1500, New = 0)

predict(model_2, newdata = not_new)
       1 
134429.8 

A new home of 1500 square feet has a predicted price of $148776.1. An old home of 1500 square feet has a predicted price of $134429.8. The difference is $14346.3.

The difference between new and old home prices is much more when the size of the home is larger. For 3000 sq ft homes, the difference is 107220.1 as opposed to the 14346.1 difference for homes that are 1500 sq ft. This is consistent with the positive coefficient for the interaction term.

h)

The model with interaction term has higher adjusted R squared even though it has a extra variable so it should be preffered.

Code
summary(model)

Call:
lm(formula = Price ~ Size + New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-205102  -34374   -5778   18929  163866 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -40230.867  14696.140  -2.738  0.00737 ** 
Size           116.132      8.795  13.204  < 2e-16 ***
New          57736.283  18653.041   3.095  0.00257 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared:  0.7226,    Adjusted R-squared:  0.7169 
F-statistic: 126.3 on 2 and 97 DF,  p-value: < 2.2e-16
Code
summary(model_2)

Call:
lm(formula = Price ~ Size + New + Size * New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-175748  -28979   -6260   14693  192519 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -22227.808  15521.110  -1.432  0.15536    
Size           104.438      9.424  11.082  < 2e-16 ***
New         -78527.502  51007.642  -1.540  0.12697    
Size:New        61.916     21.686   2.855  0.00527 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 52000 on 96 degrees of freedom
Multiple R-squared:  0.7443,    Adjusted R-squared:  0.7363 
F-statistic: 93.15 on 3 and 96 DF,  p-value: < 2.2e-16