Homework 4

Author

Saisrinivas Ambatipudi

Published

April 24, 2023

Question 1

A

library(alr4)
Loading required package: car
Loading required package: carData
Loading required package: effects
lattice theme set by effectsTheme()
See ?effectsTheme for details.
library(smss)

x1 <- 1240
x2 <- 18000
actual_price <- 145000

predicted_price <- -10536 + 53.8 * x1 + 2.84 * x2
residual <- actual_price - predicted_price
cat("A. Predicted selling price:", predicted_price, "\nResidual:", residual, "\n")
A. Predicted selling price: 107296 
Residual: 37704 

Given the variable information above (y=selling price, x1=house size, x2=lot size) and the data in the prediction equation (1,240 sqft house on an 18,000 sqft lot), the home’s estimated selling price is $107,296. Now compute the difference between this and the actual selling price ($145,000).

The difference in pricing indicates that the house sold for $37,740 more than the projected selling price.

B

home_size_coefficient <- 53.8
cat("B. House selling price predicted to increase for each square-foot increase in home size:", home_size_coefficient, "\n")
B. House selling price predicted to increase for each square-foot increase in home size: 53.8 

C

home_size_increase <- 1
lot_size_coefficient <- 2.84

lot_size_increase <- (home_size_coefficient * home_size_increase) / lot_size_coefficient
cat("C. Lot size increase needed to have the same impact as a one-square-foot increase in home size:", lot_size_increase, "\n")
C. Lot size increase needed to have the same impact as a one-square-foot increase in home size: 18.94366 

Question 2

A

library(alr4)
data(salary)

t_test <- t.test(salary ~ sex, data = salary)
t_test

    Welch Two Sample t-test

data:  salary by sex
t = 1.7744, df = 21.591, p-value = 0.09009
alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
95 percent confidence interval:
 -567.8539 7247.1471
sample estimates:
  mean in group Male mean in group Female 
            24696.79             21357.14 

Without regard to school level or work rank, the average male pay is $24,696.79, while the average female wage is $21,357.14 - a $3,339.65 difference.

B

model <- lm(salary ~ ., data = salary)
conf_int <- confint(model, level = 0.95)
conf_int
                 2.5 %      97.5 %
(Intercept) 14134.4059 17357.68946
degreePhD    -663.2482  3440.47485
rankAssoc    2985.4107  7599.31080
rankProf     8396.1546 13841.37340
sexFemale    -697.8183  3030.56452
year          285.1433   667.47476
ysdeg        -280.6397    31.49105

The pay gap between men and women is between -697.82 and 3030.57, according to a multiple linear regression with a 95% confidence range.

C

summary(model)

Call:
lm(formula = salary ~ ., data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 15746.05     800.18  19.678  < 2e-16 ***
degreePhD    1388.61    1018.75   1.363    0.180    
rankAssoc    5292.36    1145.40   4.621 3.22e-05 ***
rankProf    11118.76    1351.77   8.225 1.62e-10 ***
sexFemale    1166.37     925.57   1.260    0.214    
year          476.31      94.91   5.018 8.65e-06 ***
ysdeg        -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16

To summarize, wage increases are $1388.61 if the person has a PhD, $5292.36 if the person is an Associate Professor, $11,118.75 if the person is a full/tenured Professor, $1,166.37 if the individual is female, $476.31 for each year the individual remains at their present rank.

Salary, on the other hand, drops by $124.57 for each year after the individual obtained their maximum degree/rank level. Except for this one, all slopes are positive. Furthermore, an individual’s rank and the number of years spent at their present rank are statistically significant (less than 0.05).

D

# Change the baseline category for rank and rerun the model
salary$rank <- relevel(salary$rank, ref = "Asst")
model2 <- lm(salary ~ ., data = salary)
summary(model2)

Call:
lm(formula = salary ~ ., data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 15746.05     800.18  19.678  < 2e-16 ***
degreePhD    1388.61    1018.75   1.363    0.180    
rankAssoc    5292.36    1145.40   4.621 3.22e-05 ***
rankProf    11118.76    1351.77   8.225 1.62e-10 ***
sexFemale    1166.37     925.57   1.260    0.214    
year          476.31      94.91   5.018 8.65e-06 ***
ysdeg        -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16

None of the figured have changed.

E

# Exclude the rank variable and rerun the model
model3 <- lm(salary ~ degree + sex + year + ysdeg, data = salary)
summary(model3)

Call:
lm(formula = salary ~ degree + sex + year + ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-8146.9 -2186.9  -491.5  2279.1 11186.6 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 17183.57    1147.94  14.969  < 2e-16 ***
degreePhD   -3299.35    1302.52  -2.533 0.014704 *  
sexFemale   -1286.54    1313.09  -0.980 0.332209    
year          351.97     142.48   2.470 0.017185 *  
ysdeg         339.40      80.62   4.210 0.000114 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3744 on 47 degrees of freedom
Multiple R-squared:  0.6312,    Adjusted R-squared:  0.5998 
F-statistic: 20.11 on 4 and 47 DF,  p-value: 1.048e-09

Salary reduces by: when variable ‘rank’ is deleted: $3299.35 if the person holds a PhD, $1,286.54 if the person is female Salary gains, however, are $351.97 for each year spent at their present level, $339.40 for each year after earning their highest degree.

The slopes are divided 50/50 in terms of how many are positive and how many are negative. ‘degreePhD’, ‘year’, and ‘ysdeg’ are all statistically significant (less than 0.05), but’sexFemale’ is not.

F

# Create a new variable and run the multiple regression model
salary$new_dean <- ifelse(salary$ysdeg <= 15, "New", "Old")
model4 <- lm(salary ~ degree + sex + new_dean + year*ysdeg, data = salary)
summary(model4)

Call:
lm(formula = salary ~ degree + sex + new_dean + year * ysdeg, 
    data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-8396.1 -2171.9  -352.5  2053.3 11061.3 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 16855.764   1508.902  11.171 1.45e-14 ***
degreePhD   -3205.140   1431.791  -2.239   0.0302 *  
sexFemale   -1222.794   1345.288  -0.909   0.3682    
new_deanOld   550.409   2119.912   0.260   0.7963    
year          462.292    277.040   1.669   0.1021    
ysdeg         332.120    141.753   2.343   0.0236 *  
year:ysdeg     -4.525     10.083  -0.449   0.6557    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3814 on 45 degrees of freedom
Multiple R-squared:  0.6334,    Adjusted R-squared:  0.5845 
F-statistic: 12.96 on 6 and 45 DF,  p-value: 1.902e-08

‘Year’ and ‘ysdeg’ are likely to be highly correlated.

Question 3

A

library(smss)
data(house.selling.price)

model <- lm(Price ~ Size + New, data = house.selling.price)
summary(model)

Call:
lm(formula = Price ~ Size + New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-205102  -34374   -5778   18929  163866 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -40230.867  14696.140  -2.738  0.00737 ** 
Size           116.132      8.795  13.204  < 2e-16 ***
New          57736.283  18653.041   3.095  0.00257 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared:  0.7226,    Adjusted R-squared:  0.7169 
F-statistic: 126.3 on 2 and 97 DF,  p-value: < 2.2e-16

While controlling for size, predictor variables ‘New’ and ‘Size’ have p-values of 0.00257 and 2e-16 respectively. Both p-values are statistically significant as they are less than 0.05. This indicates that the null hypothesis can be rejected (there is no relationship between ‘New’ and ‘Price’ OR between ‘Size’ and ‘Price’ of new homes). By calculating the correlation, we can see that the correlation between ‘New’ and ‘Size’ is 0.3843, which is a wear relationship.

B

coefficients <- coef(model)
coefficients
(Intercept)        Size         New 
-40230.8668    116.1316  57736.2828 

C

# Calculate the predicted selling price
size <- 3000
new <- 1
not_new <- 0

predicted_new <- coefficients[1] + coefficients[2] * size + coefficients[3] * new
predicted_not_new <- coefficients[1] + coefficients[2] * size + coefficients[3] * not_new

predicted_new
(Intercept) 
   365900.2 
predicted_not_new
(Intercept) 
   308163.9 

If the house is new, the predicted selling price is $365,900.20 If the house isn’t new, the predicted selling price is $308,163.90

D

model_interaction <- lm(Price ~ Size * New, data = house.selling.price)
summary(model_interaction)

Call:
lm(formula = Price ~ Size * New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-175748  -28979   -6260   14693  192519 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -22227.808  15521.110  -1.432  0.15536    
Size           104.438      9.424  11.082  < 2e-16 ***
New         -78527.502  51007.642  -1.540  0.12697    
Size:New        61.916     21.686   2.855  0.00527 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 52000 on 96 degrees of freedom
Multiple R-squared:  0.7443,    Adjusted R-squared:  0.7363 
F-statistic: 93.15 on 3 and 96 DF,  p-value: < 2.2e-16

E

coefficients_interaction <- coef(model_interaction)
coefficients_interaction
 (Intercept)         Size          New     Size:New 
-22227.80793    104.43839 -78527.50235     61.91588 
library(ggplot2)

ggplot(data=house.selling.price,aes(x=Size,y=Price, color=New))+
  geom_point()+
  geom_smooth(method="lm",se=F)
`geom_smooth()` using formula = 'y ~ x'

The variables in the scatterplot have a linear/correlative connection, as seen in the graph, demonstrating that as size grows, so does the price. However, based on the colors of the dots (which correlate to the age of the home), the link isn’t as simple. New houses (light blue dots) are distributed across the graph, mostly along the slope line. The older residences (dots that aren’t light blue) are primarily clustered towards the bottom right corner of the graph, although there are a handful that outperform the price/size of brand new houses.

F

# Calculate the predicted selling price with interaction
predicted_new_interaction <- coefficients_interaction[1] + coefficients_interaction[2] * size + coefficients_interaction[3] * new + coefficients_interaction[4] * size * new
predicted_not_new_interaction <- coefficients_interaction[1] + coefficients_interaction[2] * size + coefficients_interaction[3] * not_new + coefficients_interaction[4] * size * not_new

predicted_new_interaction
(Intercept) 
   398307.5 
predicted_not_new_interaction
(Intercept) 
   291087.4 

The predicted selling price for a New home with the above measurements is $398,307.50.

The predicted selling price for a not-new home with the above measurements is $291,087.40.

G

# Calculate the predicted selling price for a home of 1500 square feet
size_1500 <- 1500

predicted_new_1500 <- coefficients_interaction[1] + coefficients_interaction[2] * size_1500 + coefficients_interaction[3] * new + coefficients_interaction[4] * size_1500 * new
predicted_not_new_1500 <- coefficients_interaction[1] + coefficients_interaction[2] * size_1500 + coefficients_interaction[3] * not_new + coefficients_interaction[4] * size_1500 * not_new

predicted_new_1500
(Intercept) 
   148776.1 
predicted_not_new_1500
(Intercept) 
   134429.8 

The predicted selling price of a new 1500sqft home is $148,776.10

The predicted selling price of a 1500sqft home that ISN’T new is $134,429.80

In comparison to the data in Part F (where the property size is doubled to 3000sqft), the expected selling prices in this section (G) are significantly lower. A new 3000sqft house is expected to sell for $398,307.50, while a new 1500sqft house is expected to sell for $148,694.70. The size and price both reduced by half, demonstrating that these two variables are connected and have a linear connection. A 3000sqft house that is NOT new is expected to sell for $291,087.40. A 1500sqft house that is not brand new is expected to sell for $134,429.80. The price difference between the two is $156,657.6. The price is more directly proportional to size than it is with new dwellings.

H

# Compare the adjusted R-squared values
summary(model)$adj.r.squared
[1] 0.7168767
summary(model_interaction)$adj.r.squared
[1] 0.7363181
# Compare residuals' distribution
par(mfrow = c(2, 1))
hist(residuals(model), main = "Residuals without Interaction", xlab = "Residuals", col = "lightblue")
hist(residuals(model_interaction), main = "Residuals with Interaction", xlab = "Residuals", col = "lightblue")

I believe that a model with no interaction better describes the relationship between ‘Size’ and ‘New’ with the outcome price; the model with interaction best represents the relationship between ‘Size’ and ‘Price’ rather than ‘Size’ and ‘New’.