Question 1
Part A
The first variable to be deleted would be beds because it has the largest p-value, and backward elimination begins with deleting the variable with the largest p-value.
Part B
The first variable added in forward selection would be size because it has the smallest p-value.
Part C
Beds has such a large p-value despite its correlation with price because it also has strong correlations with other variables. This may cause multicollinearity.
Part D
# Model 1model <-lm(P ~ . , data = house.selling.price.2)
Error in is.data.frame(data): object 'house.selling.price.2' not found
model1 <-step(model)
Error in terms(object): object 'model' not found
Error in summary(model1): object 'model1' not found
# Model 2model <-lm(P ~ . , data = house.selling.price.2)
Error in is.data.frame(data): object 'house.selling.price.2' not found
model2 <-step(model, direction =c("forward"))
Error in terms(object): object 'model' not found
Error in summary(model2): object 'model2' not found
Based on R-squared, we would want to pick Model 2 since it has a higher R-squared. This model includes all four predictor variables and was picked using forward selection. However, if judging by the Adjusted R-squared criteria, we would want to pick Model 1 since that has a higher Adjusted R-squared. Model 1 was chosen using backward eliminiation and does not include the Bed variable.
Error in residuals(linear.model): object 'model1' not found
Error in residuals(linear.model): object 'model2' not found
For the PRESS criteria, the model we would want to pick is Model 1, which was found using backward elimination and only includes New, Bath, and Size as predictor variables.
Error in AIC(model1): object 'model1' not found
Error in AIC(model2): object 'model2' not found
Judging by the AIC, Model 1 is a better fit because it has a lower AIC.
Error in BIC(model1): object 'model1' not found
Error in BIC(model2): object 'model2' not found
Model 1 is also a better fit from the BIC criteria because it also has a lower BIC. Overall, judging by all five criteria, Model 1 would be the best fit since it has better results in four out of the five.
Part E
As stated before, Model 1 has better results in four out of the five criteria (Adjusted R-Squared, PRESS, AIC, and BIC). Thus, the model I would prefer overall is Model 1, which omitted the Bed variable. The Bed variable also had an extremely high p-value compared to the other variables, so it would make sense to construct a model without it.
Question 2
Part A
model <-lm(Volume ~ Girth + Height, data = trees)summary(model)
lm(formula = Volume ~ Girth + Height, data = trees)
Min 1Q Median 3Q Max
-6.4065 -2.6493 -0.2876 2.2003 8.4847
Estimate Std. Error t value Pr(>|t|)
(Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***
Girth 4.7082 0.2643 17.816 < 2e-16 ***
Height 0.3393 0.1302 2.607 0.0145 *
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared: 0.948, Adjusted R-squared: 0.9442
F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16
Part B
There are some regression assumptions that are violated because of the data present in the plots. In the Residuals vs. Fitted plot, the line is not linear, indicating that the variances of the error terms are not equal and there may not be a linear relationship. The Scale-Location plot also shows a non-linear line, indicating that the assumption of constant variance is violated. The other two plots, Normal Q-Q and Residuals vs. Leverage, are normal.
Question 3
Part A
model <-lm(Buchanan ~ Bush, data = florida)plot(model)
Palm Beach is an outlier based on the diagnostic plots for the model because, while all the other data points are fairly close together, the Palm Beach data point is extremely far in each plot. Additionally, in the Residuals vs. Leverage plot, the Palm Beach point is outside of Cook’s distance, meaning it is an outlier with extreme influence on the data.
Part B
model <-lm(log(Buchanan) ~log(Bush), data = florida)plot(model)
The findings do change somewhat because in the new model using logs, the Palm Beach data point is now inside Cook’s distance, meaning it has less influence over the data and is less of an outlier. The distance between Palm Beach and the other data points has been reduced, but it still seems to remain somewhat of an outlier but much less than in the previous model.
