lm(P ~ ., data = house.selling.price.2) |>summary()
Call:
lm(formula = P ~ ., data = house.selling.price.2)
Residuals:
Min 1Q Median 3Q Max
-36.212 -9.546 1.277 9.406 71.953
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -41.795 12.104 -3.453 0.000855 ***
S 64.761 5.630 11.504 < 2e-16 ***
Be -2.766 3.960 -0.698 0.486763
Ba 19.203 5.650 3.399 0.001019 **
New 18.984 3.873 4.902 4.3e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 16.36 on 88 degrees of freedom
Multiple R-squared: 0.8689, Adjusted R-squared: 0.8629
F-statistic: 145.8 on 4 and 88 DF, p-value: < 2.2e-16
A
Variable Be would be eliminated first using the backwards elimination because it has the largest p-value.
B
Variable S would be added first using the forward selection because it has the smallest p-value.
C
Beds has a high p-value despite a substantial correlation with price because of multicollinearity between New and Size as they have way smaller p-values and are statistically significant.
D
Code
lm1 <-lm(P ~ ., data = house.selling.price.2)lm2 <-lm(P ~ S, Be, Ba, data = house.selling.price.2)lm3 <-lm(P ~ S, Be, data = house.selling.price.2)lm4 <-lm(P ~ S, data = house.selling.price.2)stargazer(lm1, lm2, lm3, lm4, type ="text")
The R^2 and Adjusted R^2 can be found with this chart above.
Code
# create press modelPRESS <-function(model) { i <-residuals(model)/(1-lm.influence(model)$hat)sum(i^2)}
Code
#lm1 PRESSPRESS(lm1)
[1] 28390.22
Code
#lm1 AICAIC(lm1)
[1] 790.6225
Code
#lm1 BICBIC(lm1)
[1] 805.8181
Code
#lm2 PRESSPRESS(lm2)
[1] 16270.99
Code
#lm2 AICAIC(lm2)
[1] 748.9785
Code
#lm2 BICBIC(lm2)
[1] 756.5763
Code
#lm3 PRESSPRESS(lm3)
[1] 16131.45
Code
#lm3 AICAIC(lm3)
[1] 726.8739
Code
#lm3 BICBIC(lm3)
[1] 734.4717
Code
#lm4 PRESSPRESS(lm4)
[1] 38203.29
Code
#lm4 AICAIC(lm4)
[1] 820.1439
Code
#lm4 BICBIC(lm4)
[1] 827.7417
E
Based on the criterion PRESS, R^2, Adjusted R^2, AIC, and BIC, we will go with Model 3 lm3 since it has the highest R^2 and the lowest PRESS, AIC, and BIC values.
Even when accounting for the log() function, nothing fundamentally changes when it comes to outliers.
Source Code
---title: "Kristin Abijaoude_HW5"editor: visualdate: "05/09/2023"format: html: toc: true code-fold: true code-copy: true code-tools: truecategories: - Hw5 - kristin abijaoude---```{r}# load packagespackages <-c("readr", "readxl", "summarytools", "tidyverse", "dplyr", "smss", "alr4", "stargazer", "broom", "qpcR")lapply(packages, require, character.only =TRUE)knitr::opts_chunk$set(echo =TRUE)```# Question 1```{r}data(house.selling.price.2)house.selling.price.2``````{r}lm(P ~ ., data = house.selling.price.2) |>summary()```## AVariable `Be` would be eliminated first using the backwards elimination because it has the largest p-value.## BVariable `S` would be added first using the forward selection because it has the smallest p-value.## C`Beds` has a high p-value despite a substantial correlation with `price` because of multicollinearity between `New` and `Size` as they have way smaller p-values and are statistically significant.## D```{r}lm1 <-lm(P ~ ., data = house.selling.price.2)lm2 <-lm(P ~ S, Be, Ba, data = house.selling.price.2)lm3 <-lm(P ~ S, Be, data = house.selling.price.2)lm4 <-lm(P ~ S, data = house.selling.price.2)stargazer(lm1, lm2, lm3, lm4, type ="text")```The R^2 and Adjusted R^2 can be found with this chart above.```{r}# create press modelPRESS <-function(model) { i <-residuals(model)/(1-lm.influence(model)$hat)sum(i^2)}``````{r}#lm1 PRESSPRESS(lm1)#lm1 AICAIC(lm1)#lm1 BICBIC(lm1)``````{r}#lm2 PRESSPRESS(lm2)#lm2 AICAIC(lm2)#lm2 BICBIC(lm2)``````{r}#lm3 PRESSPRESS(lm3)#lm3 AICAIC(lm3)#lm3 BICBIC(lm3)``````{r}#lm4 PRESSPRESS(lm4)#lm4 AICAIC(lm4)#lm4 BICBIC(lm4)```## EBased on the criterion `PRESS`, `R^2`, `Adjusted R^2`, `AIC`, and `BIC`, we will go with Model 3 `lm3` since it has the highest R^2 and the lowest PRESS, AIC, and BIC values.# Question 2```{r}data(trees)head(trees)```## A```{r}tree_reg <-lm(Volume ~ Girth + Height, data = trees)summary(tree_reg)```## BThere is a violation of nonlinearity since the regression line is not straight or linear.```{r}plot(tree_reg)```# Question 3## A```{r}data(florida)head(florida)``````{r}gore <-lm(Buchanan ~ Gore, data=florida)plot(gore)```From the graph, Palm Beach is an outlier because Buchanan received the most votes from that county.## B```{r}gore_log <-lm(log(Buchanan) ~log(Gore), data=florida)plot(gore_log)```Even when accounting for the log() function, nothing fundamentally changes when it comes to outliers.