Homework 5
Author

Thrishul

Published

May 5, 2023

Question 1

A

Code
library(alr4)
Warning: package 'alr4' was built under R version 4.2.3
Loading required package: car
Warning: package 'car' was built under R version 4.2.3
Loading required package: carData
Warning: package 'carData' was built under R version 4.2.3
Loading required package: effects
Warning: package 'effects' was built under R version 4.2.3
lattice theme set by effectsTheme()
See ?effectsTheme for details.
Code
library(smss)
Warning: package 'smss' was built under R version 4.2.3
Code
x1 <- 1240
x2 <- 18000
actual_price <- 145000

predicted_price <- -10536 + 53.8 * x1 + 2.84 * x2
residual <- actual_price - predicted_price
cat("A. Predicted selling price:", predicted_price, "\nResidual:", residual, "\n")
A. Predicted selling price: 107296 
Residual: 37704 

Based on the given variables, where y represents the selling price, and x1 and x2 represent the house size and lot size respectively, and using the prediction equation for a 1,240 sqft house on an 18,000 sqft lot, the estimated selling price is $107,296. However, the actual selling price of the house is $145,000, indicating that it sold for $37,704 more than the projected selling price.

B

Code
home_size_coefficient <- 53.8
cat("B. House selling price predicted to increase for each square-foot increase in home size:", home_size_coefficient, "\n")
B. House selling price predicted to increase for each square-foot increase in home size: 53.8 

C

Code
home_size_increase <- 1
lot_size_coefficient <- 2.84

lot_size_increase <- (home_size_coefficient * home_size_increase) / lot_size_coefficient
cat("C. Lot size increase needed to have the same impact as a one-square-foot increase in home size:", lot_size_increase, "\n")
C. Lot size increase needed to have the same impact as a one-square-foot increase in home size: 18.94366 

Question 2

A

Code
library(alr4)
data(salary)

t_test <- t.test(salary ~ sex, data = salary)
t_test

    Welch Two Sample t-test

data:  salary by sex
t = 1.7744, df = 21.591, p-value = 0.09009
alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
95 percent confidence interval:
 -567.8539 7247.1471
sample estimates:
  mean in group Male mean in group Female 
            24696.79             21357.14 

Irrespective of educational qualifications or employment hierarchy, the mean salary for men is $24,696.79, while the average income for women is $21,357.14, reflecting a difference of $3,339.65 between the two genders.

B

Code
model <- lm(salary ~ ., data = salary)
conf_int <- confint(model, level = 0.95)
conf_int
                 2.5 %      97.5 %
(Intercept) 14134.4059 17357.68946
degreePhD    -663.2482  3440.47485
rankAssoc    2985.4107  7599.31080
rankProf     8396.1546 13841.37340
sexFemale    -697.8183  3030.56452
year          285.1433   667.47476
ysdeg        -280.6397    31.49105

Based on a multiple linear regression analysis with a 95% confidence level, the estimated pay gap between men and women ranges from -$697.82 to $3,030.57

C

Code
summary(model)

Call:
lm(formula = salary ~ ., data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 15746.05     800.18  19.678  < 2e-16 ***
degreePhD    1388.61    1018.75   1.363    0.180    
rankAssoc    5292.36    1145.40   4.621 3.22e-05 ***
rankProf    11118.76    1351.77   8.225 1.62e-10 ***
sexFemale    1166.37     925.57   1.260    0.214    
year          476.31      94.91   5.018 8.65e-06 ***
ysdeg        -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16

In summary, there are several factors that impact wage increases, including a person’s educational qualifications and job rank. Specifically, individuals with a PhD experience an average wage increase of $1,388.61, while those in the rank of Associate Professor receive an increase of $5,292.36, and full/tenured Professors experience the highest increase of $11,118.75. Gender also plays a role, with females receiving a wage increase of $1,166.37. Additionally, for each year an individual remains at their current rank, they receive an increase of $476.31, except for individuals who have reached their maximum degree/rank level, as their salary decreases by $124.57 for each year after that. Notably, all slopes, except for the one related to maximum degree/rank level, are positive. The individual’s rank and the number of years spent at their current rank are both statistically significant, with p-values less than 0.05.

D

Code
# Change the baseline category for rank and rerun the model
salary$rank <- relevel(salary$rank, ref = "Asst")
model2 <- lm(salary ~ ., data = salary)
summary(model2)

Call:
lm(formula = salary ~ ., data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 15746.05     800.18  19.678  < 2e-16 ***
degreePhD    1388.61    1018.75   1.363    0.180    
rankAssoc    5292.36    1145.40   4.621 3.22e-05 ***
rankProf    11118.76    1351.77   8.225 1.62e-10 ***
sexFemale    1166.37     925.57   1.260    0.214    
year          476.31      94.91   5.018 8.65e-06 ***
ysdeg        -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16

There have been no changes to the values previously stated. Therefore, the wage increases and other factors mentioned in the previous statement remain the same.

E

Code
# Exclude the rank variable and rerun the model
model3 <- lm(salary ~ degree + sex + year + ysdeg, data = salary)
summary(model3)

Call:
lm(formula = salary ~ degree + sex + year + ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-8146.9 -2186.9  -491.5  2279.1 11186.6 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 17183.57    1147.94  14.969  < 2e-16 ***
degreePhD   -3299.35    1302.52  -2.533 0.014704 *  
sexFemale   -1286.54    1313.09  -0.980 0.332209    
year          351.97     142.48   2.470 0.017185 *  
ysdeg         339.40      80.62   4.210 0.000114 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3744 on 47 degrees of freedom
Multiple R-squared:  0.6312,    Adjusted R-squared:  0.5998 
F-statistic: 20.11 on 4 and 47 DF,  p-value: 1.048e-09

When the ‘rank’ variable is removed from the equation, the salary decrease is $3,299.35 for individuals with a PhD and $1,286.54 for females. On the other hand, salary gains occur at a rate of $351.97 for each year spent at the current level and $339.40 for each year after obtaining the highest degree. It is worth noting that half of the slopes are positive, while the other half are negative. Specifically, ‘degreePhD’, ‘year’, and ‘ysdeg’ are all statistically significant with p-values less than 0.05, while ‘sexFemale’ is not.

F

Code
# Create a new variable and run the multiple regression model
salary$new_dean <- ifelse(salary$ysdeg <= 15, "New", "Old")
model4 <- lm(salary ~ degree + sex + new_dean + year*ysdeg, data = salary)
summary(model4)

Call:
lm(formula = salary ~ degree + sex + new_dean + year * ysdeg, 
    data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-8396.1 -2171.9  -352.5  2053.3 11061.3 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 16855.764   1508.902  11.171 1.45e-14 ***
degreePhD   -3205.140   1431.791  -2.239   0.0302 *  
sexFemale   -1222.794   1345.288  -0.909   0.3682    
new_deanOld   550.409   2119.912   0.260   0.7963    
year          462.292    277.040   1.669   0.1021    
ysdeg         332.120    141.753   2.343   0.0236 *  
year:ysdeg     -4.525     10.083  -0.449   0.6557    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3814 on 45 degrees of freedom
Multiple R-squared:  0.6334,    Adjusted R-squared:  0.5845 
F-statistic: 12.96 on 6 and 45 DF,  p-value: 1.902e-08

It is highly likely that there is a strong correlation between ‘year’ and ‘ysdeg’.

QUESTION 3

A

Code
library(smss)
data(house.selling.price)

model <- lm(Price ~ Size + New, data = house.selling.price)
summary(model)

Call:
lm(formula = Price ~ Size + New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-205102  -34374   -5778   18929  163866 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -40230.867  14696.140  -2.738  0.00737 ** 
Size           116.132      8.795  13.204  < 2e-16 ***
New          57736.283  18653.041   3.095  0.00257 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared:  0.7226,    Adjusted R-squared:  0.7169 
F-statistic: 126.3 on 2 and 97 DF,  p-value: < 2.2e-16

When controlling for size, the predictor variables ‘New’ and ‘Size’ have p-values of 0.00257 and 2e-16, respectively. Both of these p-values are statistically significant and less than the alpha level of 0.05. This indicates that the null hypothesis can be rejected, and there is likely a relationship between ‘New’ and ‘Price’, as well as between ‘Size’ and ‘Price’ for new homes. By computing the correlation coefficient, we can observe that the correlation between ‘New’ and ‘Size’ is 0.3843, indicating a weak relationship between the two variables.

Code
coefficients <- coef(model)
coefficients
(Intercept)        Size         New 
-40230.8668    116.1316  57736.2828 

C

Code
# Calculate the predicted selling price
size <- 3000
new <- 1
not_new <- 0

predicted_new <- coefficients[1] + coefficients[2] * size + coefficients[3] * new
predicted_not_new <- coefficients[1] + coefficients[2] * size + coefficients[3] * not_new

predicted_new
(Intercept) 
   365900.2 
Code
predicted_not_new
(Intercept) 
   308163.9 

if a house is new, the estimated selling price is $365,900.20. However, if the house is not new, the predicted selling price is $308,163.90.

D

Code
model_interaction <- lm(Price ~ Size * New, data = house.selling.price)
summary(model_interaction)

Call:
lm(formula = Price ~ Size * New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-175748  -28979   -6260   14693  192519 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -22227.808  15521.110  -1.432  0.15536    
Size           104.438      9.424  11.082  < 2e-16 ***
New         -78527.502  51007.642  -1.540  0.12697    
Size:New        61.916     21.686   2.855  0.00527 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 52000 on 96 degrees of freedom
Multiple R-squared:  0.7443,    Adjusted R-squared:  0.7363 
F-statistic: 93.15 on 3 and 96 DF,  p-value: < 2.2e-16

E

Code
coefficients_interaction <- coef(model_interaction)
coefficients_interaction
 (Intercept)         Size          New     Size:New 
-22227.80793    104.43839 -78527.50235     61.91588 
Code
library(ggplot2)

ggplot(data=house.selling.price,aes(x=Size,y=Price, color=New))+
  geom_point()+
  geom_smooth(method="lm",se=F)
`geom_smooth()` using formula = 'y ~ x'
Warning: The following aesthetics were dropped during statistical transformation: colour
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

The scatterplot shows a linear/correlative relationship between the variables, where an increase in size is associated with an increase in price. However, when considering the colors of the dots (which represent the age of the homes), the relationship is not as straightforward. The new houses (light blue dots) are dispersed throughout the graph, mostly along the slope line. On the other hand, the older residences (dots that are not light blue) are mainly clustered in the bottom-right corner of the graph, although there are a few that have a higher price/size than brand new homes.

F

Code
# Calculate the predicted selling price with interaction
predicted_new_interaction <- coefficients_interaction[1] + coefficients_interaction[2] * size + coefficients_interaction[3] * new + coefficients_interaction[4] * size * new
predicted_not_new_interaction <- coefficients_interaction[1] + coefficients_interaction[2] * size + coefficients_interaction[3] * not_new + coefficients_interaction[4] * size * not_new

predicted_new_interaction
(Intercept) 
   398307.5 
Code
predicted_not_new_interaction
(Intercept) 
   291087.4 

According to the given data, for a house with the specified measurements, the predicted selling price for a new home is $398,307.50, while for a not-new home, it is $291,087.40.

G

Code
# Calculate the predicted selling price for a home of 1500 square feet
size_1500 <- 1500

predicted_new_1500 <- coefficients_interaction[1] + coefficients_interaction[2] * size_1500 + coefficients_interaction[3] * new + coefficients_interaction[4] * size_1500 * new
predicted_not_new_1500 <- coefficients_interaction[1] + coefficients_interaction[2] * size_1500 + coefficients_interaction[3] * not_new + coefficients_interaction[4] * size_1500 * not_new

predicted_new_1500
(Intercept) 
   148776.1 
Code
predicted_not_new_1500
(Intercept) 
   134429.8 

The predicted selling price of a new 1500sqft home is $148,776.10, while the predicted selling price of a 1500sqft home that is not new is $134,429.80. These values are significantly lower than the expected selling prices in Part F, where the property size is doubled to 3000sqft. A new 3000sqft house is expected to sell for $398,307.50, while a new 1500sqft house is expected to sell for $148,694.70. The reduction in size and price by half shows a linear connection between the two variables.

In comparison to new dwellings, the price of not-new dwellings is more directly proportional to size. A 3000sqft house that is not new is expected to sell for $291,087.40, while a 1500sqft house that is not new is expected to sell for $134,429.80. The price difference between these two values is $156,657.6. Therefore, size has a more significant impact on the selling price of not-new homes than new ones.

H

Code
# Compare the adjusted R-squared values
summary(model)$adj.r.squared
[1] 0.7168767
Code
summary(model_interaction)$adj.r.squared
[1] 0.7363181
Code
# Compare residuals' distribution
par(mfrow = c(2, 1))
hist(residuals(model), main = "Residuals without Interaction", xlab = "Residuals", col = "darkmagenta")
hist(residuals(model_interaction), main = "Residuals with Interaction", xlab = "Residuals", col = "darkmagenta")

I think that a model without an interaction term provides a better fit for the relationship between ‘Size’ and ‘New’ regarding the outcome variable price. On the other hand, the model with the interaction term better captures the relationship between ‘Size’ and ‘Price’ rather than ‘Size’ and ‘New’.