Code
library(tidyverse)
library(smss)
library(alr4)
::opts_chunk$set(echo = TRUE) knitr
Megha Joseph
November 14, 2022
#A
[1] 107296
The predicted selling price is 107,296 dollars and the actual selling price is 145,000 dollars. The residual is 37,704 dollars, meaning that the house was sold for 37,704 dollars greater than predicted.
For a fixed lot size, the house selling price is predicted to increase by 53.8 for each square foot increase in home size.This is because a fixed lot size would make 2.84x2 a set number in the prediction equation. Therefore, we would not need to factor in a change in the output based on any input. Then, we are left with the coefficient for the home size variable, which is 53.8. For x1=1, representing one square-foot of home size, the output would increase by 53.8*1 = 53.8.
#C
[1] 18.94366
The lot size would need to increase by about 18.94 square feet in order to have an equivalent impact as an additional square foot of home size.
degree rank sex year ysdeg salary
1 Masters Prof Male 25 35 36350
2 Masters Prof Male 13 22 35350
3 Masters Prof Male 10 23 28200
4 Masters Prof Female 7 27 26775
5 PhD Prof Male 19 30 33696
6 Masters Prof Male 16 21 28516
Call:
lm(formula = salary ~ sex, data = salary)
Residuals:
Min 1Q Median 3Q Max
-8602.8 -4296.6 -100.8 3513.1 16687.9
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24697 938 26.330 <2e-16 ***
sexFemale -3340 1808 -1.847 0.0706 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5782 on 50 degrees of freedom
Multiple R-squared: 0.0639, Adjusted R-squared: 0.04518
F-statistic: 3.413 on 1 and 50 DF, p-value: 0.0706
The female coefficient is -3340, says that women make less than men (indepemdent of other cariable). However, there is a significance level of .07, so we fail to reject the null hypothesis and therefore cannot conclude that there is a difference between mean salaries for men and women.
Call:
lm(formula = salary ~ ., data = salary)
Residuals:
Min 1Q Median 3Q Max
-4045.2 -1094.7 -361.5 813.2 9193.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15746.05 800.18 19.678 < 2e-16 ***
degreePhD 1388.61 1018.75 1.363 0.180
rankAssoc 5292.36 1145.40 4.621 3.22e-05 ***
rankProf 11118.76 1351.77 8.225 1.62e-10 ***
sexFemale 1166.37 925.57 1.260 0.214
year 476.31 94.91 5.018 8.65e-06 ***
ysdeg -124.57 77.49 -1.608 0.115
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared: 0.855, Adjusted R-squared: 0.8357
F-statistic: 44.24 on 6 and 45 DF, p-value: < 2.2e-16
95% Confidence interval for the difference in salary between male and females is -697.82 and 3030.56
Rank and year are significant predictors of salary, while all others were not.
Both rank and year positively predict salary: Associate Professors and full Professors were likely to earn quite a bit more than Assistant Professors, while professors with more years in their current rank also earned more.
Looking at the magnitude of the coefficients, rank has a greater impact on salary than year does.
Call:
lm(formula = salary ~ rank, data = salary)
Residuals:
Min 1Q Median 3Q Max
-5209.0 -1819.2 -417.8 1586.6 8386.0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29659.0 669.3 44.316 < 2e-16 ***
rankAsst -11890.3 972.4 -12.228 < 2e-16 ***
rankAssoc -6483.0 1043.0 -6.216 1.09e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2993 on 49 degrees of freedom
Multiple R-squared: 0.7542, Adjusted R-squared: 0.7442
F-statistic: 75.17 on 2 and 49 DF, p-value: 1.174e-15
As we relabel the baseline category for the rank variable, we see a decreament of $ 11890.3 salary for assistant and $ 6483.0 salary for associate as compared to Professor. Both ranks have significance levels well below 0.05 and we can determine that rank does have a statistically significant impact on salary.
Call:
lm(formula = salary ~ degree + sex + year + ysdeg, data = salary)
Residuals:
Min 1Q Median 3Q Max
-8146.9 -2186.9 -491.5 2279.1 11186.6
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17183.57 1147.94 14.969 < 2e-16 ***
degreePhD -3299.35 1302.52 -2.533 0.014704 *
sexFemale -1286.54 1313.09 -0.980 0.332209
year 351.97 142.48 2.470 0.017185 *
ysdeg 339.40 80.62 4.210 0.000114 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3744 on 47 degrees of freedom
Multiple R-squared: 0.6312, Adjusted R-squared: 0.5998
F-statistic: 20.11 on 4 and 47 DF, p-value: 1.048e-09
Now, when we remove rank variable, we see cofficient for sex = -1286.54 and with rank variable it is 1166.37. Furthermore, the female salary will be 1286.54 less than the male salary without rank variable. However, the significance level is 0.332, which is very high and therefore the results cannot be found to be statistically significant. While the change of the coefficient to negative upon removal of rank is interesting, the significance level would likely prevent these results from holding up in court as an indication of discrimination on the basis of sex.
Call:
lm(formula = salary ~ dean_hired, data = salary)
Residuals:
Min 1Q Median 3Q Max
-8294 -3486 -1772 3829 10576
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 27469.4 913.4 30.073 < 2e-16 ***
dean_hired1 -7343.5 1291.8 -5.685 6.73e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4658 on 50 degrees of freedom
Multiple R-squared: 0.3926, Adjusted R-squared: 0.3804
F-statistic: 32.32 on 1 and 50 DF, p-value: 6.734e-07
Call:
lm(formula = salary ~ sex + rank + degree + dean_hired, data = salary)
Residuals:
Min 1Q Median 3Q Max
-6187.5 -1750.9 -438.9 1719.5 9362.9
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29511.3 784.0 37.640 < 2e-16 ***
sexFemale -829.2 997.6 -0.831 0.410
rankAsst -11925.7 1512.4 -7.885 4.37e-10 ***
rankAssoc -7100.4 1297.0 -5.474 1.76e-06 ***
degreePhD 1126.2 1018.4 1.106 0.275
dean_hired1 319.0 1303.8 0.245 0.808
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3023 on 46 degrees of freedom
Multiple R-squared: 0.7645, Adjusted R-squared: 0.7389
F-statistic: 29.87 on 5 and 46 DF, p-value: 2.192e-13
I create a boolean variable named “dean_hired” where it will be equal to 1 for employed for 15 years or less and equal to 0 employed for over 15 years. Using this variable, I fitted a regression model along with variables sex, degree, rank and hired. To avoid multicollinearity I skipped year and ysdeg variables.Furthermore, our new variable is a product of ysdeg, hence, its not needed to be included.
Based on the regression model, those hired by the current Dean are predicted to make 319 dollars more than those not hired by the Dean, which can be argued to be a insignificant salary. Furthermore, the level of significance for the hired variable is .81, which indicates that the relationship between hired and salary is not statistically significant. Based on these factors, I would state that findings do not indicate any favorable treatment by the Dean toward faculty that the Dean specifically hired.
case Taxes Beds Baths New Price Size
1 1 3104 4 2 0 279900 2048
2 2 1173 2 1 0 146500 912
3 3 3076 4 2 0 237700 1654
4 4 1608 3 2 0 200000 2068
5 5 1454 3 3 0 159900 1477
6 6 2997 3 2 1 499900 3153
7 7 4054 3 2 0 265500 1355
8 8 3002 3 2 1 289900 2075
9 9 6627 5 4 0 587000 3990
10 10 320 3 2 0 70000 1160
11 11 630 3 2 0 64500 1220
12 12 1780 3 2 0 167000 1690
13 13 1630 3 2 0 114600 1380
14 14 1530 3 2 0 103000 1590
15 15 930 3 1 0 101000 1050
16 16 590 2 1 0 70000 770
17 17 1050 3 2 0 85000 1410
18 18 20 3 1 0 22500 1060
19 19 870 2 2 0 90000 1300
20 20 1320 3 2 0 133000 1500
21 21 1350 2 1 0 90500 820
22 22 5616 4 3 1 577500 3949
23 23 680 2 1 0 142500 1170
24 24 1840 3 2 0 160000 1500
25 25 3680 4 2 0 240000 2790
26 26 1660 3 1 0 87000 1030
27 27 1620 3 2 0 118600 1250
28 28 3100 3 2 0 140000 1760
29 29 2070 2 3 0 148000 1550
30 30 830 3 2 0 69000 1120
31 31 2260 4 2 0 176000 2000
32 32 1760 3 1 0 86500 1350
33 33 2750 3 2 1 180000 1840
34 34 2020 4 2 0 179000 2510
35 35 4900 3 3 1 338000 3110
36 36 1180 4 2 0 130000 1760
37 37 2150 3 2 0 163000 1710
38 38 1600 2 1 0 125000 1110
39 39 1970 3 2 0 100000 1360
40 40 2060 3 1 0 100000 1250
41 41 1980 3 1 0 100000 1250
42 42 1510 3 2 0 146500 1480
43 43 1710 3 2 0 144900 1520
44 44 1590 3 2 0 183000 2020
45 45 1230 3 2 0 69900 1010
46 46 1510 2 2 0 60000 1640
47 47 1450 2 2 0 127000 940
48 48 970 3 2 0 86000 1580
49 49 150 2 2 0 50000 860
50 50 1470 3 2 0 137000 1420
51 51 1850 3 2 0 121300 1270
52 52 820 2 1 0 81000 980
53 53 2050 4 2 0 188000 2300
54 54 710 3 2 0 85000 1430
55 55 1280 3 2 0 137000 1380
56 56 1360 3 2 0 145000 1240
57 57 830 3 2 0 69000 1120
58 58 800 3 2 0 109300 1120
59 59 1220 3 2 0 131500 1900
60 60 3360 4 3 0 200000 2430
61 61 210 3 2 0 81900 1080
62 62 380 2 1 0 91200 1350
63 63 1920 4 3 0 124500 1720
64 64 4350 3 3 0 225000 4050
65 65 1510 3 2 0 136500 1500
66 66 4154 3 3 0 381000 2581
67 67 1976 3 2 1 250000 2120
68 68 3605 3 3 1 354900 2745
69 69 1400 3 2 0 140000 1520
70 70 790 2 2 0 89900 1280
71 71 1210 3 2 0 137000 1620
72 72 1550 3 2 0 103000 1520
73 73 2800 3 2 0 183000 2030
74 74 2560 3 2 0 140000 1390
75 75 1390 4 2 0 160000 1880
76 76 5443 3 2 0 434000 2891
77 77 2850 2 1 0 130000 1340
78 78 2230 2 2 0 123000 940
79 79 20 2 1 0 21000 580
80 80 1510 4 2 0 85000 1410
81 81 710 3 2 0 69900 1150
82 82 1540 3 2 0 125000 1380
83 83 1780 3 2 1 162600 1470
84 84 2920 2 2 1 156900 1590
85 85 1710 3 2 1 105900 1200
86 86 1880 3 2 0 167500 1920
87 87 1680 3 2 0 151800 2150
88 88 3690 5 3 0 118300 2200
89 89 900 2 2 0 94300 860
90 90 560 3 1 0 93900 1230
91 91 2040 4 2 0 165000 1140
92 92 4390 4 3 1 285000 2650
93 93 690 3 1 0 45000 1060
94 94 2100 3 2 0 124900 1770
95 95 2880 4 2 0 147000 1860
96 96 990 2 2 0 176000 1060
97 97 3030 3 2 0 196500 1730
98 98 1580 3 2 0 132200 1370
99 99 1770 3 2 0 88400 1560
100 100 1430 3 2 0 127200 1340
#A
Call:
lm(formula = Price ~ Size + New, data = house.selling.price)
Residuals:
Min 1Q Median 3Q Max
-205102 -34374 -5778 18929 163866
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -40230.867 14696.140 -2.738 0.00737 **
Size 116.132 8.795 13.204 < 2e-16 ***
New 57736.283 18653.041 3.095 0.00257 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared: 0.7226, Adjusted R-squared: 0.7169
F-statistic: 126.3 on 2 and 97 DF, p-value: < 2.2e-16
Both size and New variable positively predict selling price. As we change $ 1 in price, it results in 116.132 change in size and 57736.283 units in New.
\[\\[0.2in]\]
#B
Call:
lm(formula = Price ~ Size, data = new_home)
Residuals:
Min 1Q Median 3Q Max
-78606 -16092 -987 20068 76140
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -100755.31 42513.73 -2.370 0.0419 *
Size 166.35 17.09 9.735 4.47e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 45500 on 9 degrees of freedom
Multiple R-squared: 0.9133, Adjusted R-squared: 0.9036
F-statistic: 94.76 on 1 and 9 DF, p-value: 4.474e-06
Call:
lm(formula = Price ~ Size, data = old_home)
Residuals:
Min 1Q Median 3Q Max
-175748 -29155 -7297 14159 192519
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -22227.808 15708.186 -1.415 0.161
Size 104.438 9.538 10.950 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 52620 on 87 degrees of freedom
Multiple R-squared: 0.5795, Adjusted R-squared: 0.5747
F-statistic: 119.9 on 1 and 87 DF, p-value: < 2.2e-16
For the filtered data wrt new home and old home, Size positively predicts price (But by a greater value wrt new homes). Adjusted R-squared for the model is also much higher (0.91 vs. 0.58) for new home and old home respectively.
New_Price = 166.35 * Size - 100755.31
Old_Price = 104.438 * Size - 22227.808
#C
[1] "New Price = 398294.690000"
[1] "Old Price = 291086.192000"
#D
Call:
lm(formula = Price ~ Size * New, data = house.selling.price)
Residuals:
Min 1Q Median 3Q Max
-175748 -28979 -6260 14693 192519
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -22227.808 15521.110 -1.432 0.15536
Size 104.438 9.424 11.082 < 2e-16 ***
New -78527.502 51007.642 -1.540 0.12697
Size:New 61.916 21.686 2.855 0.00527 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 52000 on 96 degrees of freedom
Multiple R-squared: 0.7443, Adjusted R-squared: 0.7363
F-statistic: 93.15 on 3 and 96 DF, p-value: < 2.2e-16
#E
The predicted selling price, based on the new regression that includes interaction between Size and Newness, would look like:
price_with_sizeAndNew = ( -22227.81 + 104.44 * Size ) + ( -78527.50 + 61.92 * Size )
price_with_size = -22227.81 + 104.44 * Size
#F
[1] "New Price = 398324.690000"
[1] "Old Price = 291092.190000"
#G
[1] "New Price = 148784.690000"
[1] "Old Price = 134432.190000"
As size of home goes up, the difference in predicted selling prices between old and new homes becomes larger.
#H When we apply the interaction (having both size and new variable), then we see a significantly large negative coefficient. The adjusted r-squared for the model with Size and New variable combined is 0.7363 and the adjusted r-squared for the first model with just Size variable is 0.7169. The increase in the adjusted r-squared with the interaction model could be due to an additional variable or could indicate a slightly better fit for the prediction of the data. Although both models have almost similar adjusted r-squared value, I would prefer the model with interaction (with Size and New variable) because the regression indicates that the interaction term is statistically significant to selling price prediction, so I feel it is necessary to utilize an equation that factors for this.
---
title: "HOME WORK 4"
author: "Megha Joseph"
desription: "HW4"
date: "11/14/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- hw4
- megha joseph
---
```{r}
#| label: setup
#| warning: false
library(tidyverse)
library(smss)
library(alr4)
knitr::opts_chunk$set(echo = TRUE)
```
## Question Answer 1
#A
```{r}
# Predicted selling price
SP <- function(a, b)
{-10536 + 53.8*a + 2.84*b}
SP(1240, 18000)
```
```{r}
# Residual
RSL <- function(r, p)
{r- p}
RSL (145000, 107296)
```
The predicted selling price is 107,296 dollars and the actual selling price is 145,000 dollars.
The residual is 37,704 dollars, meaning that the house was sold for 37,704 dollars greater than predicted.
# B
For a fixed lot size, the house selling price is predicted to increase by 53.8 for each square foot increase in home size.This is because a fixed lot size would make 2.84x2 a set number in the prediction equation. Therefore, we would not need to factor in a change in the output based on any input. Then, we are left with the coefficient for the home size variable, which is 53.8. For x1=1, representing one square-foot of home size, the output would increase by 53.8*1 = 53.8.
#C
```{r}
# Calculating lot size needed for equal impact of 1 unit increase in home size
# 53.8(1) = 2.84x2
x2 <- 53.8/2.84
x2
```
The lot size would need to increase by about 18.94 square feet in order to have an equivalent impact as an additional square foot of home size.
## Question Answer 2
# A
```{r}
# Load data and preview
data(salary)
head(salary)
GR<- lm(salary ~ sex, data = salary)
summary(GR)
```
The female coefficient is -3340, says that women make less than men (indepemdent of other cariable). However, there is a significance level of .07, so we fail to reject the null hypothesis and therefore cannot conclude that there is a difference between mean salaries for men and women.
# B
```{r}
female <- summary(lm(salary ~ ., data = salary))
female
```
95% Confidence interval for the difference in salary between male and females is -697.82 and 3030.56
# C
Rank and year are significant predictors of salary, while all others were not.
Both rank and year positively predict salary: Associate Professors and full Professors were likely to earn quite a bit more than Assistant Professors, while professors with more years in their current rank also earned more.
Looking at the magnitude of the coefficients, rank has a greater impact on salary than year does.
# D
```{r}
salary$rank <- relevel(salary$rank, ref = "Prof")
sal_rank <- lm(salary ~ rank, salary)
summary(sal_rank)
```
As we relabel the baseline category for the rank variable, we see a decreament of \$ 11890.3 salary for assistant and \$ 6483.0 salary for associate as compared to Professor. Both ranks have significance levels well below 0.05 and we can determine that rank does have a statistically significant impact on salary.
# E
```{r}
summary(lm(salary ~ degree + sex + year + ysdeg, salary))
```
Now, when we remove rank variable, we see cofficient for sex = -1286.54 and with rank variable it is 1166.37. Furthermore, the female salary will be 1286.54 less than the male salary without rank variable. However, the significance level is 0.332, which is very high and therefore the results cannot be found to be statistically significant. While the change of the coefficient to negative upon removal of rank is interesting, the significance level would likely prevent these results from holding up in court as an indication of discrimination on the basis of sex.
# F
```{r}
salary <- salary %>%mutate(dean_hired = case_when(ysdeg <= 15 ~ "1", ysdeg > 15 ~ "0"))
summary(lm(salary ~ dean_hired, data = salary))
```
```{r}
summary(lm(salary ~ sex + rank + degree + dean_hired, data = salary))
```
I create a boolean variable named "dean_hired" where it will be equal to 1 for employed for 15 years or less and equal to 0 employed for over 15 years. Using this variable, I fitted a regression model along with variables sex, degree, rank and hired. To avoid multicollinearity I skipped year and ysdeg variables.Furthermore, our new variable is a product of ysdeg, hence, its not needed to be included.
Based on the regression model, those hired by the current Dean are predicted to make 319 dollars more than those not hired by the Dean, which can be argued to be a insignificant salary. Furthermore, the level of significance for the hired variable is .81, which indicates that the relationship between hired and salary is not statistically significant. Based on these factors, I would state that findings do not indicate any favorable treatment by the Dean toward faculty that the Dean specifically hired.
## Question Answer 3
```{r}
data("house.selling.price")
house.selling.price
```
#A
```{r}
summary(lm(Price ~ Size + New, data = house.selling.price))
```
Both size and New variable positively predict selling price. As we change \$ 1 in price, it results in 116.132 change in size and 57736.283 units in New.
$$\\[0.2in]$$
#B
```{r}
new_home <- house.selling.price %>% filter(New == 1)
summary(lm(Price ~ Size, data = new_home))
```
```{r}
old_home <- house.selling.price %>% filter(New == 0)
summary(lm(Price ~ Size, data = old_home))
```
For the filtered data wrt new home and old home, Size positively predicts price (But by a greater value wrt new homes). Adjusted R-squared for the model is also much higher (0.91 vs. 0.58) for new home and old home respectively.
New_Price = 166.35 * Size - 100755.31
Old_Price = 104.438 * Size - 22227.808
#C
```{r}
Size <- 3000
New_Price = 166.35 * Size - 100755.31
Old_Price = 104.438 * Size - 22227.808
sprintf("New Price = %f", New_Price)
sprintf("Old Price = %f", Old_Price)
```
#D
```{r}
summary(lm(Price ~ Size*New, data = house.selling.price))
```
#E
The predicted selling price, based on the new regression that includes interaction between Size and Newness, would look like:
price_with_sizeAndNew = ( -22227.81 + 104.44 * Size ) + ( -78527.50 + 61.92 * Size )
price_with_size = -22227.81 + 104.44 * Size
#F
```{r}
Size <- 3000
New_Price_withSizeAndNew = ( -22227.81 + 104.44 * Size ) +( - 78527.50 + 61.92 * Size )
Old_Price_withSize = -22227.81 + 104.44 * Size
sprintf("New Price = %f", New_Price_withSizeAndNew)
sprintf("Old Price = %f", Old_Price_withSize)
```
#G
```{r}
Size <- 1500
New_Price_withSizeAndNew = -22227.81 + 104.44 * Size - 78527.50 * 1 + 61.92 * Size * 1
Old_Price_withSize = -22227.81 + 104.44 * Size
sprintf("New Price = %f", New_Price_withSizeAndNew)
sprintf("Old Price = %f", Old_Price_withSize)
```
As size of home goes up, the difference in predicted selling prices between old and new homes becomes larger.
#H
When we apply the interaction (having both size and new variable), then we see a significantly large negative coefficient. The adjusted r-squared for the model with Size and New variable combined is 0.7363 and the adjusted r-squared for the first model with just Size variable is 0.7169. The increase in the adjusted r-squared with the interaction model could be due to an additional variable or could indicate a slightly better fit for the prediction of the data. Although both models have almost similar adjusted r-squared value, I would prefer the model with interaction (with Size and New variable) because the regression indicates that the interaction term is statistically significant to selling price prediction, so I feel it is necessary to utilize an equation that factors for this.