Loading required package: car
Loading required package: carData
Attaching package: 'car'
The following object is masked from 'package:dplyr':
recode
The following object is masked from 'package:purrr':
some
Loading required package: effects
lattice theme set by effectsTheme()
See ?effectsTheme for details.
Code
library(ggplot2)
Question 1
A. For backward elimination, which variable would be deleted first? Why?
The variable Beds would be deleted first because the p-value is greatest.
B. For forward selection, which variable would be added first? Why?
The variable Size would be first added because the p-value is smallest.
C. Why do you think that BEDS has such a large P-value in the multiple regression model, even though it has a substantial correlation with PRICE?
I believe Beds is highly identical to Size, which causes the multicollinearity.
D. Using software with these four predictors, find the model that would be selected using each criterion:
Call:
lm(formula = Volume ~ Girth + Height, data = trees)
Residuals:
Min 1Q Median 3Q Max
-6.4065 -2.6493 -0.2876 2.2003 8.4847
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***
Girth 4.7082 0.2643 17.816 < 2e-16 ***
Height 0.3393 0.1302 2.607 0.0145 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared: 0.948, Adjusted R-squared: 0.9442
F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16
B. Run regression diagnostic plots on the model. Based on the plots, do you think any of the regression assumptions is violated?
Code
par(mfrow=c(2,3))plot(trees_model_1, which=1:6)
I have noted that several plots show clear patterns. the line shows curvature in thhe Residue and Fitted plot. The same shaple has been found in the Scale-location plot. Normal Q-Q demonstrates that points larele don’t fall along the line.
Question 3
Code
data("florida")head(florida)
Gore Bush Buchanan
ALACHUA 47300 34062 262
BAKER 2392 5610 73
BAY 18850 38637 248
BRADFORD 3072 5413 65
BREVARD 97318 115185 570
BROWARD 386518 177279 789
A. Run a simple linear regression model where the Buchanan vote is the outcome and the Bush vote is the explanatory variable. Produce the regression diagnostic plots. Is Palm Beach County an outlier based on the diagnostic plots? Why or why not?
According to the Cook’s distance of the new model, although it seems disclose more outlies, Palm Beach is still a significant one.
Source Code
---title: "Homework 5"author: "Guanhua Tan"description: "Homework 5"date: "05/07/2023"format: html: toc: true code-fold: true code-copy: true code-tools: truecategories: - hw5 - regression analysis---```{r}library(tidyverse)library(smss)library(alr4)library(ggplot2)```# Question 1A. For backward elimination, which variable would be deleted first? Why?The variable Beds would be deleted first because the p-value is greatest.B. For forward selection, which variable would be added first? Why?The variable Size would be first added because the p-value is smallest.C. Why do you think that BEDS has such a large P-value in the multiple regression model,even though it has a substantial correlation with PRICE?I believe Beds is highly identical to Size, which causes the multicollinearity.D. Using software with these four predictors, find the model that would be selected using eachcriterion:1. R22. Adjusted R23. PRESS4. AIC5. BIC```{r}data("house.selling.price.2")model_house <-lm(P~S+Be+Ba+New, data=house.selling.price.2)model_house_no_Be <-lm(P~S+Ba+New, data=house.selling.price.2)# Presspr <-resid(model_house)/(1-lm.influence(model_house)$hat)sum(pr^2)pr <-resid(model_house_no_Be)/(1-lm.influence(model_house_no_Be)$hat)sum(pr^2)# AICbroom::glance(model_house)broom::glance(model_house_no_Be)```R2: model_house is 0.87; model_house_no_be is 0.87.Adjusted R2: model_house is 0.86; model_house_no_be is 0.86.Press:model_house is 28390.22; model_house_no_be is 27860.05.AIC: model_house is 790.6225; model_house_no_be is 789.1366.BIC: model_house is 805.8181; model_house_no_be is 801.7996.E Explain which model you prefer and whyI'd like to prefer model_house_no_b because the AIC, BIC and Press are smaller than those of model_house.# Question 2```{r}data("trees")head(trees)```A. Fit a multiple regression model with the Volume as the outcome and Girth and Height asthe explanatory variables```{r}trees_model_1 <-lm(Volume~Girth+Height, data=trees)summary(trees_model_1)```B. Run regression diagnostic plots on the model. Based on the plots, do you think any of theregression assumptions is violated?```{r}par(mfrow=c(2,3))plot(trees_model_1, which=1:6)```I have noted that several plots show clear patterns. the line shows curvature in thhe Residue and Fitted plot. The same shaple has been found in the Scale-location plot. Normal Q-Q demonstrates that points larele don't fall along the line.# Question 3```{r}data("florida")head(florida)```A. Run a simple linear regression model where the Buchanan vote is the outcome and theBush vote is the explanatory variable. Produce the regression diagnostic plots. Is Palm BeachCounty an outlier based on the diagnostic plots? Why or why not?```{r}votes_2000 <-lm(Buchanan~Bush, data=florida)summary(votes_2000)par(mfrow=c(2,3))plot(votes_2000, which=1:6)```According to the Cook's distance plots, Palm Beach County is a outlier.B. Take the log of both variables (Bush vote and Buchanan Vote) and repeat the analysis in(A.) Does your findings change?```{r}votes_2000_trans <-lm(log(Buchanan)~log(Bush), data=florida)par(mfrow=c(2,3))plot(votes_2000_trans, which=1:6)```According to the Cook's distance of the new model, although it seems disclose more outlies, Palm Beach is still a significant one.