Code
-10536+53.8*1240+2.84*18000
[1] 107296
HW4
November 11, 2022
For recent data in Jacksonville, Florida, on y = selling price of home (in dollars), x1 = size of home (in square feet), and x2 = lot size (in square feet), the prediction equation is y^ = ???10,536 + 53.8x1 + 2.84x2.
A particular home of 1240 square feet on a lot of 18,000 square feet sold for $145,000. Find the predicted selling price and the residual, and interpret.
$107,297 is the predictive price.
The House was overpaid by 37,704.
B.For fixed lot size, how much is the house selling price predicted to increase for each square-foot increase in home size? Why?
[1] 53.8
$53.8
The Lot Size is 19 Square Feet
According to this prediction equation, for fixed home size, how much would lot size need to increase to have the same impact as a one-square-foot increase in home size?
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
Loading required package: car
Loading required package: carData
Attaching package: 'car'
The following object is masked from 'package:dplyr':
recode
Loading required package: effects
lattice theme set by effectsTheme()
See ?effectsTheme for details.
Welch Two Sample t-test
data: Female$salary and Male$salary
t = -1.7744, df = 21.591, p-value = 0.09009
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-7247.1471 567.8539
sample estimates:
mean of x mean of y
21357.14 24696.79
Based on the T test we can conclude that it’s a Null Hypothesis.
B.Run a multiple linear regression with salary as the outcome variable and everything else as predictors, including sex. Assuming no interactions between sex and the other predictors, obtain a 95% confidence interval for the difference in salary between males and females.
Call:
lm(formula = salary ~ rank + sex + degree + ysdeg + year, data = salary)
Residuals:
Min 1Q Median 3Q Max
-4045.2 -1094.7 -361.5 813.2 9193.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15746.05 800.18 19.678 < 2e-16 ***
rankAssoc 5292.36 1145.40 4.621 3.22e-05 ***
rankProf 11118.76 1351.77 8.225 1.62e-10 ***
sexFemale 1166.37 925.57 1.260 0.214
degreePhD 1388.61 1018.75 1.363 0.180
ysdeg -124.57 77.49 -1.608 0.115
year 476.31 94.91 5.018 8.65e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared: 0.855, Adjusted R-squared: 0.8357
F-statistic: 44.24 on 6 and 45 DF, p-value: < 2.2e-16
2.5 % 97.5 %
(Intercept) 14134.4059 17357.68946
rankAssoc 2985.4107 7599.31080
rankProf 8396.1546 13841.37340
sexFemale -697.8183 3030.56452
degreePhD -663.2482 3440.47485
ysdeg -280.6397 31.49105
year 285.1433 667.47476
C.Interpret your finding for each predictor variable; discuss (a) statistical significance, (b) interpretation of the coefficient / slope in relation to the outcome variable and other variables
The only predictive variables are Rank and Year according to the model.
Change the baseline category for the rank variable. Interpret the coefficients related to rank again.
Call:
lm(formula = salary ~ rank + relevel_sex + degree + ysdeg + year,
data = salary)
Residuals:
Min 1Q Median 3Q Max
-4045.2 -1094.7 -361.5 813.2 9193.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 16912.42 816.44 20.715 < 2e-16 ***
rankAssoc 5292.36 1145.40 4.621 3.22e-05 ***
rankProf 11118.76 1351.77 8.225 1.62e-10 ***
relevel_sexMale -1166.37 925.57 -1.260 0.214
degreePhD 1388.61 1018.75 1.363 0.180
ysdeg -124.57 77.49 -1.608 0.115
year 476.31 94.91 5.018 8.65e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared: 0.855, Adjusted R-squared: 0.8357
F-statistic: 44.24 on 6 and 45 DF, p-value: < 2.2e-16
When using Relevel, we see a significant change in the coefficient, the coefficient of Gender is negative showing a negative relationship.
Finkelstein (1980), in a discussion of the use of regression in discrimination cases, wrote, “a variable may reflect a position or status bestowed by the employer, in which case if there is discrimination in the award of the position or status, the variable may be ‘tainted.’” Thus, for example, if discrimination is at work in promotion of faculty to higher ranks, using rank to adjust salaries before comparing the sexes may not be acceptable to the courts.
Exclude the variable rank, refit, and summarize how your findings changed, if they did.
Call:
lm(formula = salary ~ sex + degree + ysdeg + year, data = salary)
Residuals:
Min 1Q Median 3Q Max
-8146.9 -2186.9 -491.5 2279.1 11186.6
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17183.57 1147.94 14.969 < 2e-16 ***
sexFemale -1286.54 1313.09 -0.980 0.332209
degreePhD -3299.35 1302.52 -2.533 0.014704 *
ysdeg 339.40 80.62 4.210 0.000114 ***
year 351.97 142.48 2.470 0.017185 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3744 on 47 degrees of freedom
Multiple R-squared: 0.6312, Adjusted R-squared: 0.5998
F-statistic: 20.11 on 4 and 47 DF, p-value: 1.048e-09
When Rank is excluded from Predictor variables,variables degree, ysdeg, and year all find an increase in significance.
Everyone in this dataset was hired the year they earned their highest degree. It is also known that a new Dean was appointed 15 years ago, and everyone in the dataset who earned their highest degree 15 years ago or less than that has been hired by the new Dean. Some people have argued that the new Dean has been making offers that are a lot more generous to newly hired faculty than the previous one and that this might explain some of the variation in Salary.
Call:
lm(formula = salary ~ New_Hire + sex + degree, data = salary)
Residuals:
Min 1Q Median 3Q Max
-8260.4 -3557.7 -462.6 3563.2 12098.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 28663 1155 24.821 < 2e-16 ***
New_Hire -7418 1306 -5.679 7.74e-07 ***
sexFemale -2716 1433 -1.896 0.064 .
degreePhD -1227 1372 -0.895 0.375
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4558 on 48 degrees of freedom
Multiple R-squared: 0.4416, Adjusted R-squared: 0.4067
F-statistic: 12.65 on 3 and 48 DF, p-value: 3.231e-06
The current Model actually shows that New Hires are actually being paid. So being a new hire does not explain the difference.
##QUESTION 3
A.Using the house.selling.price data, run and report regression results modeling y = selling price (in dollars) in terms of size of home (in square feet) and whether the home is new (1 = yes; 0 = no). In particular, for each variable; discuss statistical significance and interpret the meaning of the coefficient.
Warning: package 'smss' was built under R version 4.2.2
Call:
lm(formula = Price ~ Size + New, data = house.selling.price)
Residuals:
Min 1Q Median 3Q Max
-205102 -34374 -5778 18929 163866
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -40230.867 14696.140 -2.738 0.00737 **
Size 116.132 8.795 13.204 < 2e-16 ***
New 57736.283 18653.041 3.095 0.00257 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared: 0.7226, Adjusted R-squared: 0.7169
F-statistic: 126.3 on 2 and 97 DF, p-value: < 2.2e-16
B.Report and interpret the prediction equation, and form separate equations relating selling price to size for new and for not new homes.
Price Increases are done by 116.32 for every additional square foot.
Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.
D.Fit another model, this time with an interaction term allowing interaction between size and new, and report the regression results
Call:
lm(formula = Price ~ Size + New + Size * New, data = house.selling.price)
Residuals:
Min 1Q Median 3Q Max
-175748 -28979 -6260 14693 192519
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -22227.808 15521.110 -1.432 0.15536
Size 104.438 9.424 11.082 < 2e-16 ***
New -78527.502 51007.642 -1.540 0.12697
Size:New 61.916 21.686 2.855 0.00527 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 52000 on 96 degrees of freedom
Multiple R-squared: 0.7443, Adjusted R-squared: 0.7363
F-statistic: 93.15 on 3 and 96 DF, p-value: < 2.2e-16
E.Report the lines relating the predicted selling price to the size for homes that are (i) new, (ii) not new.
The New Size variable has a high significance and Positive Coefficient.This shows both variables together are is a much better combination.
F. Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.
G.Find the predicted selling price for a home of 1500 square feet that is (i) new, (ii) not new. Comparing to (F), explain how the difference in predicted selling prices changes as the size of home increases.
Newer and Larger homes are the most expensive whereas old smaller houses are least Expensive.
3H.Do you think the model with interaction or the one without it represents the relationship of size and new to the outcome price? What makes you prefer one model over another?
The second model I think is better as it has more variability and accounts for the fact that combining Size and New is much better, the signifiance is greatly improved.
---
title: "Homework 4"
author: "HW4"
description: "HW4"
date: "11/11/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- hw1
- desriptive statistics
- probability
---
# Question 1
## a
For recent data in Jacksonville, Florida, on y = selling price of home (in dollars), x1 = size of home (in square feet), and x2 = lot size (in square feet), the prediction equation is
y^ = ???10,536 + 53.8x1 + 2.84x2.
A particular home of 1240 square feet on a lot of 18,000 square feet sold for $145,000. Find the predicted selling price and the residual, and interpret.
```{r, echo=T}
-10536+53.8*1240+2.84*18000
```
$107,297 is the predictive price.
```{r}
145000-107296
```
The House was overpaid by 37,704.
B.For fixed lot size, how much is the house selling price predicted to increase for each square-foot increase in home size? Why?
```{r}
x1240 = 53.8*(1240) + 2.84*(18000) - 10536
x1241 = 53.8*(1241) + 2.84*(18000) - 10536
x1242 = 53.8*(1242) + 2.84*(18000) - 10536
x1241-x1240
```
$53.8
```{r}
53.8/2.84
```
The Lot Size is 19 Square Feet
According to this prediction equation, for fixed home size, how much would lot size need to increase to have the same impact as a one-square-foot increase in home size?
```{r, echo=T}
library(dplyr)
library(alr4)
data("salary")
"Female" <- salary %>%
filter(sex == "Female")
"Male" <- salary %>%
filter(sex == "Male")
t.test(Female$salary, Male$salary)
```
Based on the T test we can conclude that it's a Null Hypothesis.
B.Run a multiple linear regression with salary as the outcome variable and everything else as predictors, including sex. Assuming no interactions between sex and the other predictors, obtain a 95% confidence interval for the difference in salary between males and females.
```{r}
"Fit" <- lm(salary ~ rank + sex + degree + ysdeg + year, data = salary)
summary(Fit)
confint(Fit, level = .95)
```
C.Interpret your finding for each predictor variable; discuss (a) statistical significance, (b) interpretation of the coefficient / slope in relation to the outcome variable and other variables
The only predictive variables are Rank and Year according to the model.
Change the baseline category for the rank variable. Interpret the coefficients related to rank again.
```{r}
"relevel_sex" <- relevel(salary$sex, ref = "Female")
"New_Fit" <- lm(salary ~ rank + relevel_sex + degree + ysdeg + year, data = salary)
summary(New_Fit)
```
When using Relevel, we see a significant change in the coefficient, the coefficient of Gender is negative showing a negative relationship.
Finkelstein (1980), in a discussion of the use of regression in discrimination cases, wrote, "[a] variable may reflect a position or status bestowed by the employer, in which case if there is discrimination in the award of the position or status, the variable may be 'tainted.' " Thus, for example, if discrimination is at work in promotion of faculty to higher ranks, using rank to adjust salaries before comparing the sexes may not be acceptable to the courts.
Exclude the variable rank, refit, and summarize how your findings changed, if they did.
```{r}
"Fit_No_Rank" <- lm(salary ~ sex + degree + ysdeg + year, data = salary)
summary(Fit_No_Rank)
```
When Rank is excluded from Predictor variables,variables degree, ysdeg, and year all find an increase in significance.
Everyone in this dataset was hired the year they earned their highest degree. It is also known that a new Dean was appointed 15 years ago, and everyone in the dataset who earned their highest degree 15 years ago or less than that has been hired by the new Dean. Some people have argued that the new Dean has been making offers that are a lot more generous to newly hired faculty than the previous one and that this might explain some of the variation in Salary.
```{r}
salary$New_Hire <- ifelse(salary$ysdeg <= 15, 1, 0)
New_Dean <- lm(salary ~ New_Hire + sex + degree , data = salary)
summary(New_Dean)
```
The current Model actually shows that New Hires are actually being paid. So being a new hire does not explain the difference.
##QUESTION 3
A.Using the house.selling.price data, run and report regression results modeling y = selling price (in dollars) in terms of size of home (in square feet) and whether the home is new (1 = yes; 0 = no). In particular, for each variable; discuss statistical significance and interpret the meaning of the coefficient.
```{r}
library("smss")
data("house.selling.price")
Price_Model <- lm(Price ~ Size + New, data = house.selling.price)
summary(Price_Model)
```
B.Report and interpret the prediction equation, and form separate equations relating selling price to size for new and for not new homes.
Price Increases are done by 116.32 for every additional square foot.
Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.
# New Home
```{r}
116.132*(3000) + 17505.416
```
```{r}
116.132*(3000) - 40230.867
```
D.Fit another model, this time with an interaction term allowing interaction between size and new, and report the regression results
```{r}
New_Price_Model <- lm(Price ~ Size + New + Size*New, data = house.selling.price)
summary(New_Price_Model)
```
E.Report the lines relating the predicted selling price to the size for homes that are (i) new, (ii) not new.
The New Size variable has a high significance and Positive Coefficient.This shows both variables together are is a much better combination.
F. Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.
```{r}
104.438*(3000) + 61.916*(3000) - 100755.31
```
```{r}
104.438*(3000) - 22227.808
```
G.Find the predicted selling price for a home of 1500 square feet that is (i) new, (ii) not new. Comparing to (F), explain how the difference in predicted selling prices changes as the size of home increases.
```{r}
# New Home
104.438*(1500) + 61.916*(1500) - 100755.31
```
```{r}
# Old Home
104.438*(1500) - 22227.808
```
Newer and Larger homes are the most expensive whereas old smaller houses are least Expensive.
3H.Do you think the model with interaction or the one without it represents the relationship of size and new to the outcome price? What makes you prefer one model over another?
The second model I think is better as it has more variability and accounts for the fact that combining Size and New is much better, the signifiance is greatly improved.