Code
library(tidyverse)
library(ggplot2)
library(stats)
library(alr4)
library(smss)
::opts_chunk$set(echo = TRUE) knitr
Niharika Pola
November 27, 2022
[1] 107296
From the above result, we can say that the house was sold for 37704 dollars greater than predicted.
Using the prediction equation ŷ = -10536 + 53.8x1 + 2.84x2, where x2 equals lot size, the house selling price is expected to increase by 53.8 dollars per each square-foot increase in home size given the lot sized is fixed. This is because a fixed lot size would make 2.84x2 a set number in the prediction equation. Therefore, we would not need to factor in a change in the output based on any input. Then, we are left with the coefficient for the home size variable, which is 53.8. For x1 = 1, representing one square-foot of home size, the output would increase by 53.8 * 1 = 53.8.
For fixed home size,
53.8 * 1 = 2.84x2
An increase in lot size of about 18.94 square-feet would have the same impact as an increase of 1 square-foot in home size on the predicted selling price.
Call:
lm(formula = salary ~ sex, data = salary)
Residuals:
Min 1Q Median 3Q Max
-8602.8 -4296.6 -100.8 3513.1 16687.9
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24697 938 26.330 <2e-16 ***
sexFemale -3340 1808 -1.847 0.0706 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5782 on 50 degrees of freedom
Multiple R-squared: 0.0639, Adjusted R-squared: 0.04518
F-statistic: 3.413 on 1 and 50 DF, p-value: 0.0706
The null hypothesis would be that mean salary for men and mean salary for women are equal, and the alternative hypothesis would be that the salaries are not equal. I ran a regression with sex as the explanatory variable and salary as the outcome variable. The female coefficient is -3340, which means that women do make less than men not considering any other variables. However, if we consider the other variables and also there is a significance level of 0.07, so we fail to reject the null hypothesis and therefore cannot conclude that there is a difference between mean salaries for men and women.
Call:
lm(formula = salary ~ ., data = salary)
Residuals:
Min 1Q Median 3Q Max
-4045.2 -1094.7 -361.5 813.2 9193.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15746.05 800.18 19.678 < 2e-16 ***
degreePhD 1388.61 1018.75 1.363 0.180
rankAssoc 5292.36 1145.40 4.621 3.22e-05 ***
rankProf 11118.76 1351.77 8.225 1.62e-10 ***
sexFemale 1166.37 925.57 1.260 0.214
year 476.31 94.91 5.018 8.65e-06 ***
ysdeg -124.57 77.49 -1.608 0.115
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared: 0.855, Adjusted R-squared: 0.8357
F-statistic: 44.24 on 6 and 45 DF, p-value: < 2.2e-16
2.5 % 97.5 %
(Intercept) 14134.4059 17357.68946
degreePhD -663.2482 3440.47485
rankAssoc 2985.4107 7599.31080
rankProf 8396.1546 13841.37340
sexFemale -697.8183 3030.56452
year 285.1433 667.47476
ysdeg -280.6397 31.49105
Assuming there is no interaction between sex and other predictors, we can be 95% confident that the difference in salary of women compared to men falls between -697.8183 dollars and 3030.56452 dollars.
For degree as the predictor, a PHD would be expected to increase salary by 1388.61 dollars in reference to a Masters degree salary. However, at a significance level of 0.18, we cannot conclude that degree level has a statistically significant impact on salary.
For the rank variable, an Associate can expect a 5292.36 dollar increase in salary compared to Assistant, while a Professor can expect a 11118.76 dollar salary increase compared to Assistant. Both ranks have significance levels well below 0.05 and we can determine that rank does have a statistically significant impact on salary.
For the variable of sex, a Female can expect a salary increase of 1166.37 dollars in comparison to Male salary, but the significance level is 0.214, so this is not a statistically significant relationship.
For year, a faculty member can expect a salary increase of 476.31 dollars for an increase in 1 year of employment in his/her/their position. Additionally, the level of significance is less than 0.01 so the relationship between year and salary appears to be significant.
For the ysdeg variable, an increase in years since earning highest degree can expect a decrease in salary, with a coefficient of -124.57. However, with a 0.115 level of significance, this relationship cannot be found to be statistically significant.
Call:
lm(formula = salary ~ rank, data = salary)
Residuals:
Min 1Q Median 3Q Max
-5209.0 -1819.2 -417.8 1586.6 8386.0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29659.0 669.3 44.316 < 2e-16 ***
rankAsst -11890.3 972.4 -12.228 < 2e-16 ***
rankAssoc -6483.0 1043.0 -6.216 1.09e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2993 on 49 degrees of freedom
Multiple R-squared: 0.7542, Adjusted R-squared: 0.7442
F-statistic: 75.17 on 2 and 49 DF, p-value: 1.174e-15
After changing the baseline category for the rank variable, an Associate can expect a 6483.0 dollar decrease in salary compared to Professor, while a Assistant can expect a 11890.3 dollar salary decrease compared to Professor. Both ranks have significance levels well below 0.05 and we can determine that rank does have a statistically significant impact on salary.
Call:
lm(formula = salary ~ degree + sex + year + ysdeg, data = salary)
Residuals:
Min 1Q Median 3Q Max
-8146.9 -2186.9 -491.5 2279.1 11186.6
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17183.57 1147.94 14.969 < 2e-16 ***
degreePhD -3299.35 1302.52 -2.533 0.014704 *
sexFemale -1286.54 1313.09 -0.980 0.332209
year 351.97 142.48 2.470 0.017185 *
ysdeg 339.40 80.62 4.210 0.000114 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3744 on 47 degrees of freedom
Multiple R-squared: 0.6312, Adjusted R-squared: 0.5998
F-statistic: 20.11 on 4 and 47 DF, p-value: 1.048e-09
When removing the variable “rank”, the coefficient for sex is -1286.54 compared to the above regression that included rank with a coefficient for sex at 1166.37. The new coefficient predicts that a female salary would be 1286.54 less than a male salary, when excluding the variable of rank. However, the significance level is 0.332, which is very high and therefore the results cannot be found to be statistically significant. While the change of the coefficient to negative upon removal of rank is interesting, the significance level would likely prevent these results from holding up in court as an indication of discrimination on the basis of sex.
Call:
lm(formula = salary ~ hired, data = salary)
Residuals:
Min 1Q Median 3Q Max
-8294 -3486 -1772 3829 10576
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 27469.4 913.4 30.073 < 2e-16 ***
hired1 -7343.5 1291.8 -5.685 6.73e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4658 on 50 degrees of freedom
Multiple R-squared: 0.3926, Adjusted R-squared: 0.3804
F-statistic: 32.32 on 1 and 50 DF, p-value: 6.734e-07
Call:
lm(formula = salary ~ sex + rank + degree + hired, data = salary)
Residuals:
Min 1Q Median 3Q Max
-6187.5 -1750.9 -438.9 1719.5 9362.9
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29511.3 784.0 37.640 < 2e-16 ***
sexFemale -829.2 997.6 -0.831 0.410
rankAsst -11925.7 1512.4 -7.885 4.37e-10 ***
rankAssoc -7100.4 1297.0 -5.474 1.76e-06 ***
degreePhD 1126.2 1018.4 1.106 0.275
hired1 319.0 1303.8 0.245 0.808
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3023 on 46 degrees of freedom
Multiple R-squared: 0.7645, Adjusted R-squared: 0.7389
F-statistic: 29.87 on 5 and 46 DF, p-value: 2.192e-13
I created a dummy variable called “hired” which coded those employed for 15 years or less (thus hired by the new Dean) as 1 and those who have been employed for over 15 years as 0. Then, I fit a new regression model and decided to include the variables of sex, rank, degree, and hired. I omitted the year and ysdeg variables to prevent overlapping or multicollinearity. Multicollinearity can be a concern when variables are highly correlated or related in some way. The idea of regression is to observe how each variable partially effects the output while holding the other variables fixed. We cannot reasonably change the year or ysdeg or hired variables individually while holding the other two fixed since they tend to “grow” in similar manners. Since the variable hired is a product of the ysdeg variable, we could not include both.
Based on the regression model, those hired by the current Dean are predicted to make 319 dollars more than those not hired by the Dean. When it comes to salary, this is a rather insignificant number. Furthermore, the level of significance for the hired variable is .81, which is astronomical and indicates that the relationship between hired and salary is not statistically significant. Based on these factors, I would state that findings do not indicate any favorable treatment by the Dean toward faculty that the Dean specifically hired.
Call:
lm(formula = Price ~ Size + New, data = house.selling.price)
Residuals:
Min 1Q Median 3Q Max
-205102 -34374 -5778 18929 163866
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -40230.867 14696.140 -2.738 0.00737 **
Size 116.132 8.795 13.204 < 2e-16 ***
New 57736.283 18653.041 3.095 0.00257 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared: 0.7226, Adjusted R-squared: 0.7169
F-statistic: 126.3 on 2 and 97 DF, p-value: < 2.2e-16
Both Size and New significantly positively predict selling price. As each predictor goes up by 1 unit, selling price rises by 116.132 dollars and 57736.283 dollars respectively.
Call:
lm(formula = Price ~ Size, data = new)
Residuals:
Min 1Q Median 3Q Max
-78606 -16092 -987 20068 76140
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -100755.31 42513.73 -2.370 0.0419 *
Size 166.35 17.09 9.735 4.47e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 45500 on 9 degrees of freedom
Multiple R-squared: 0.9133, Adjusted R-squared: 0.9036
F-statistic: 94.76 on 1 and 9 DF, p-value: 4.474e-06
Call:
lm(formula = Price ~ Size, data = old)
Residuals:
Min 1Q Median 3Q Max
-175748 -29155 -7297 14159 192519
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -22227.808 15708.186 -1.415 0.161
Size 104.438 9.538 10.950 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 52620 on 87 degrees of freedom
Multiple R-squared: 0.5795, Adjusted R-squared: 0.5747
F-statistic: 119.9 on 1 and 87 DF, p-value: < 2.2e-16
Size significantly positively predicts price for both new and old houses, but by a greater magnitude for new houses. Adjusted R-squared for the model is also much higher (0.91 vs. 0.58).
New_Price = 166 * Size - 100755.31
Old_Price = 104 * Size - 22227.808
Call:
lm(formula = Price ~ Size * New, data = house.selling.price)
Residuals:
Min 1Q Median 3Q Max
-175748 -28979 -6260 14693 192519
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -22227.808 15521.110 -1.432 0.15536
Size 104.438 9.424 11.082 < 2e-16 ***
New -78527.502 51007.642 -1.540 0.12697
Size:New 61.916 21.686 2.855 0.00527 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 52000 on 96 degrees of freedom
Multiple R-squared: 0.7443, Adjusted R-squared: 0.7363
F-statistic: 93.15 on 3 and 96 DF, p-value: < 2.2e-16
The predicted selling price, based on the new regression that includes interaction between Size and Newness, would look like:
New_Price = -22227.81 + 104.44 * Size - 78527.50 * 1 + 61.92 * Size * 1
Old_Price = -22227.81 + 104.44 * Size
[1] 148784.7
[1] 134432.2
As size of home goes up, the difference in predicted selling prices between old and new homes becomes larger.
The prediction model with interaction has a significantly large negative coefficient for the New variable. The adjusted r-squared for the model with interaction is 0.7363 and the adjusted r-squared for the first model without interaction is 0.7169. The increase in the adjusted r-squared with the interaction model could be due to an additional variable or could indicate a slightly better fit for the prediction of the data. Since the models do have similar adjusted r-squared values, I would prefer the model with interaction because the regression indicates that the interaction term is statistically significant to selling price prediction, so I feel it is necessary to utilize an equation that factors for this.
---
title: "Homework 4"
author: "Niharika Pola"
description: "homework-4"
date: "11/27/2022"
format:
html:
df-print: paged
css: styles.css
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- hw4
---
```{r}
#| label: setup
#| warning: false
library(tidyverse)
library(ggplot2)
library(stats)
library(alr4)
library(smss)
knitr::opts_chunk$set(echo = TRUE)
```
## Question 1
## A
```{r}
Predicted_selling_price <- -10536 + 53.8 * 1240 + 2.84 * 18000
Predicted_selling_price
```
```{r}
Residual <- Predicted_selling_price - 145000
Residual
```
From the above result, we can say that the house was sold for 37704 dollars greater than predicted.
## B
Using the prediction equation ŷ = -10536 + 53.8x1 + 2.84x2, where x2 equals lot size, the house selling price is expected to increase by 53.8 dollars per each square-foot increase in home size given the lot sized is fixed. This is because a fixed lot size would make 2.84x2 a set number in the prediction equation. Therefore, we would not need to factor in a change in the output based on any input. Then, we are left with the coefficient for the home size variable, which is 53.8. For x1 = 1, representing one square-foot of home size, the output would increase by 53.8 \* 1 = 53.8.
## C
For fixed home size,
53.8 \* 1 = 2.84x2
```{r}
x2 <- 53.8/2.84
x2
```
An increase in lot size of about 18.94 square-feet would have the same impact as an increase of 1 square-foot in home size on the predicted selling price.
## Question 2
```{r}
data("salary")
salary
```
## A
```{r}
summary(lm(salary ~ sex, data = salary))
```
The null hypothesis would be that mean salary for men and mean salary for women are equal, and the alternative hypothesis would be that the salaries are not equal. I ran a regression with sex as the explanatory variable and salary as the outcome variable. The female coefficient is -3340, which means that women do make less than men not considering any other variables. However, if we consider the other variables and also there is a significance level of 0.07, so we fail to reject the null hypothesis and therefore cannot conclude that there is a difference between mean salaries for men and women.
## B
```{r}
model <- lm(salary ~ ., data = salary)
summary(model)
```
```{r}
confint(model)
```
Assuming there is no interaction between sex and other predictors, we can be 95% confident that the difference in salary of women compared to men falls between -697.8183 dollars and 3030.56452 dollars.
## C
For degree as the predictor, a PHD would be expected to increase salary by 1388.61 dollars in reference to a Masters degree salary. However, at a significance level of 0.18, we cannot conclude that degree level has a statistically significant impact on salary.
For the rank variable, an Associate can expect a 5292.36 dollar increase in salary compared to Assistant, while a Professor can expect a 11118.76 dollar salary increase compared to Assistant. Both ranks have significance levels well below 0.05 and we can determine that rank does have a statistically significant impact on salary.
For the variable of sex, a Female can expect a salary increase of 1166.37 dollars in comparison to Male salary, but the significance level is 0.214, so this is not a statistically significant relationship.
For year, a faculty member can expect a salary increase of 476.31 dollars for an increase in 1 year of employment in his/her/their position. Additionally, the level of significance is less than 0.01 so the relationship between year and salary appears to be significant.
For the ysdeg variable, an increase in years since earning highest degree can expect a decrease in salary, with a coefficient of -124.57. However, with a 0.115 level of significance, this relationship cannot be found to be statistically significant.
## D
```{r}
salary$rank <- relevel(salary$rank, ref = "Prof")
summary(lm(salary ~ rank, salary))
```
After changing the baseline category for the rank variable, an Associate can expect a 6483.0 dollar decrease in salary compared to Professor, while a Assistant can expect a 11890.3 dollar salary decrease compared to Professor. Both ranks have significance levels well below 0.05 and we can determine that rank does have a statistically significant impact on salary.
## E
```{r}
summary(lm(salary ~ degree + sex + year + ysdeg, salary))
```
When removing the variable "rank", the coefficient for sex is -1286.54 compared to the above regression that included rank with a coefficient for sex at 1166.37. The new coefficient predicts that a female salary would be 1286.54 less than a male salary, when excluding the variable of rank. However, the significance level is 0.332, which is very high and therefore the results cannot be found to be statistically significant. While the change of the coefficient to negative upon removal of rank is interesting, the significance level would likely prevent these results from holding up in court as an indication of discrimination on the basis of sex.
## F
```{r}
salary <- salary %>%
mutate(hired = case_when(ysdeg <= 15 ~ "1", ysdeg > 15 ~ "0"))
summary(lm(salary ~ hired, data = salary))
```
```{r}
summary(lm(salary ~ sex + rank + degree + hired, data = salary))
```
I created a dummy variable called "hired" which coded those employed for 15 years or less (thus hired by the new Dean) as 1 and those who have been employed for over 15 years as 0. Then, I fit a new regression model and decided to include the variables of sex, rank, degree, and hired. I omitted the year and ysdeg variables to prevent overlapping or multicollinearity. Multicollinearity can be a concern when variables are highly correlated or related in some way. The idea of regression is to observe how each variable partially effects the output while holding the other variables fixed. We cannot reasonably change the year or ysdeg or hired variables individually while holding the other two fixed since they tend to "grow" in similar manners. Since the variable hired is a product of the ysdeg variable, we could not include both.
Based on the regression model, those hired by the current Dean are predicted to make 319 dollars more than those not hired by the Dean. When it comes to salary, this is a rather insignificant number. Furthermore, the level of significance for the hired variable is .81, which is astronomical and indicates that the relationship between hired and salary is not statistically significant. Based on these factors, I would state that findings do not indicate any favorable treatment by the Dean toward faculty that the Dean specifically hired.
## Question 3
```{r}
data("house.selling.price")
house.selling.price
```
## A
```{r}
summary(lm(Price ~ Size + New, house.selling.price))
```
Both Size and New significantly positively predict selling price. As each predictor goes up by 1 unit, selling price rises by 116.132 dollars and 57736.283 dollars respectively.
## B
```{r}
new <- house.selling.price %>%
filter(New == 1)
summary(lm(Price ~ Size, data = new))
```
```{r}
old <- house.selling.price %>%
filter(New == 0)
summary(lm(Price ~ Size, data = old))
```
Size significantly positively predicts price for both new and old houses, but by a greater magnitude for new houses. Adjusted R-squared for the model is also much higher (0.91 vs. 0.58).
New_Price = 166 \* Size - 100755.31
Old_Price = 104 \* Size - 22227.808
## C
```{r}
Size <- 3000
New_Price = 166 * Size - 100755.31
Old_Price = 104 * Size - 22227.808
New_Price
Old_Price
```
## D
```{r}
summary(lm(Price ~ Size*New, data = house.selling.price))
```
## E
The predicted selling price, based on the new regression that includes interaction between Size and Newness, would look like:
New_Price = -22227.81 + 104.44 \* Size - 78527.50 \* 1 + 61.92 \* Size \* 1
Old_Price = -22227.81 + 104.44 \* Size
## F
```{r}
Size <- 3000
New_Price = -22227.81 + 104.44 * Size - 78527.50 * 1 + 61.92 * Size * 1
Old_Price = -22227.81 + 104.44 * Size
New_Price
Old_Price
```
## G
```{r}
Size <- 1500
New_Price = -22227.81 + 104.44 * Size - 78527.50 * 1 + 61.92 * Size * 1
Old_Price = -22227.81 + 104.44 * Size
New_Price
Old_Price
```
As size of home goes up, the difference in predicted selling prices between old and new homes becomes larger.
## H
The prediction model with interaction has a significantly large negative coefficient for the New variable. The adjusted r-squared for the model with interaction is 0.7363 and the adjusted r-squared for the first model without interaction is 0.7169. The increase in the adjusted r-squared with the interaction model could be due to an additional variable or could indicate a slightly better fit for the prediction of the data. Since the models do have similar adjusted r-squared values, I would prefer the model with interaction because the regression indicates that the interaction term is statistically significant to selling price prediction, so I feel it is necessary to utilize an equation that factors for this.