Code
# setup
library(tidyverse)
library(alr4)
library(smss)
library(stargazer)
library(MPV)
::opts_chunk$set(echo = TRUE) knitr
Alexis Gamez
May 13, 2023
For the house.selling.price.2 data the tables below show a correlation matrix and a model fit using four predictors of selling price.
For backward elimination, which variable would be deleted first? Why?
Using backward elimination, I would delete the Beds
variable first. The point of backward elimination is to fit a model using all possible variables provided and 1 by 1 eliminating the least significant variables from said model. In this case, the Beds
variable is the least significant in the model with a p-value of 0.487.
For forward selection, which variable would be added first? Why?
Forward selection works opposite to backward elimination where we start with an empty model and 1 by 1 add variables according to their significance level. Under these circumstances, I would add the Size
variable first, as it is the most significant, with a p-value of 0.
Why do you think that BEDS has such a large P-value in the multiple regression model, even though it has a substantial correlation with PRICE?
If there is 1 thing we’ve learned this semester, it’s that correlation does not dictate causation. So while both are strongly correlated, it appears that the significance of the Beds
variable within the model diminishes with the addition of the other variables. In other words, the effect of Beds
on Price
is not as significant in comparison to the effect of the other variables on Price
.
Using software with these four predictors, find the model that would be selected using each criterion:
P S Be Ba
Min. : 17.50 Min. :0.40 Min. :1.000 Min. :1.000
1st Qu.: 72.90 1st Qu.:1.33 1st Qu.:3.000 1st Qu.:2.000
Median : 96.00 Median :1.57 Median :3.000 Median :2.000
Mean : 99.53 Mean :1.65 Mean :3.183 Mean :1.957
3rd Qu.:115.00 3rd Qu.:1.98 3rd Qu.:4.000 3rd Qu.:2.000
Max. :309.40 Max. :3.85 Max. :5.000 Max. :3.000
New
Min. :0.0000
1st Qu.:0.0000
Median :0.0000
Mean :0.3011
3rd Qu.:1.0000
Max. :1.0000
For this question, I’ll be utilizing stargazer to visualize the results to answer parts 1 & 2.
===================================================================================================================
Dependent variable:
-----------------------------------------------------------------------------------------------
P
(1) (2) (3) (4)
-------------------------------------------------------------------------------------------------------------------
S 75.607*** 72.575*** 62.263*** 64.761***
(3.865) (3.508) (4.335) (5.630)
Be -2.766
(3.960)
Ba 20.072*** 19.203***
(5.495) (5.650)
New 19.587*** 18.371*** 18.984***
(3.995) (3.761) (3.873)
Constant -25.194*** -26.089*** -47.992*** -41.795***
(6.688) (5.977) (8.209) (12.104)
-------------------------------------------------------------------------------------------------------------------
Observations 93 93 93 93
R2 0.808 0.848 0.868 0.869
Adjusted R2 0.806 0.845 0.864 0.863
Residual Std. Error 19.473 (df = 91) 17.395 (df = 90) 16.313 (df = 89) 16.360 (df = 88)
F Statistic 382.628*** (df = 1; 91) 251.775*** (df = 2; 90) 195.313*** (df = 3; 89) 145.763*** (df = 4; 88)
===================================================================================================================
Note: *p<0.1; **p<0.05; ***p<0.01
The best model according to this criteria would be model 4, as its R^2 value is the closest to 1 (0.869).
The best model according to this criteria would be model 3, as its adjusted R^2 value is the closest to 1 (0.864).
Model 1
Model 2
Model 3
Model 4
Knowing that a lower PRESS value indicates a better fit model, I would select model 3 according to this criteria.
Model 1
Model 2
Model 3
Model 4
Similar to the PRESS criteria, the lower the returned AIC value, the better the model fits. Again, under these circumstances I would select model 3 as the best fit.
Model 1
Model 2
Model 3
Model 4
Identical to AIC, the lowest BIC value indicates the best fit model. Once again, model 3 would be the best fit under these circumstances.
Explain which model you prefer and why.
The only models proven to be significant through previous criterion are models 3 & 4. Even then, model 4 was only proven to be significant according to it’s R^2 value and I would argue that the more significant value to consider would be the adjusted R^2. Thus, if I were to select 1 of these 2 models, I would select model 3 as the best fit model, especially when considering the criterion previously calculated.
Tree volume estimation is a big deal, especially in the lumber industry. Use the trees data to build a basic model of tree volume prediction. In particular,
Fit a multiple regression model with the Volume as the outcome and Girth and Height as the explanatory variables.
Girth Height Volume
Min. : 8.30 Min. :63 Min. :10.20
1st Qu.:11.05 1st Qu.:72 1st Qu.:19.40
Median :12.90 Median :76 Median :24.20
Mean :13.25 Mean :76 Mean :30.17
3rd Qu.:15.25 3rd Qu.:80 3rd Qu.:37.30
Max. :20.60 Max. :87 Max. :77.00
With the data loaded, I’ll fit the model below:
Call:
lm(formula = Volume ~ Girth + Height, data = trees)
Residuals:
Min 1Q Median 3Q Max
-6.4065 -2.6493 -0.2876 2.2003 8.4847
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***
Girth 4.7082 0.2643 17.816 < 2e-16 ***
Height 0.3393 0.1302 2.607 0.0145 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared: 0.948, Adjusted R-squared: 0.9442
F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16
Run regression diagnostic plots on the model. Based on the plots, do you think any of the regression assumptions is violated?
Immediately, it’s apparent that at least a couple of regression assumptions are violated throughout the 6 diagnostic plots. Most noteworthy, are the Residuals vs Fitted, Scale-Location and Cook’s Distance plots. It’s obvious that the for the 1st plot that the linearity assumption is violated. Similarly, the lack of a steady horizontal trend in the Scale-Location plot indicates a violation of the constant variance assumption. Lastly, it’s apparent in the Cook’s distance plot that the 31st observation, which is also an extreme outlier, holds significantly more weight within the model than any other telling me that the significance of all observations is skewed and not well fit.
In the 2000 election for U.S. president, the counting of votes in Florida was controversial. In Palm Beach County in south Florida, for example, voters used a so-called butterfly ballot. Some believe that the layout of the ballot caused some voters to cast votes for Buchanan when their intended choice was Gore.
Run a simple linear regression model where the Buchanan vote is the outcome and the Bush vote is the explanatory variable. Produce the regression diagnostic plots. Is Palm Beach County an outlier based on the diagnostic plots? Why or why not?
Gore Bush Buchanan
Min. : 788 Min. : 1316 Min. : 9.0
1st Qu.: 3055 1st Qu.: 4746 1st Qu.: 46.5
Median : 14152 Median : 20196 Median : 114.0
Mean : 43341 Mean : 43356 Mean : 258.5
3rd Qu.: 45974 3rd Qu.: 56542 3rd Qu.: 285.5
Max. :386518 Max. :289456 Max. :3407.0
With the data loaded, I’ve fit & visualized the requested model below:
With the model fit and the diagnostic plots created, it’s extremely apparent that Palm Beach is indeed an outlier and an extreme one at that. In all plots, Palm Beach is shown to deviate entirely from any trend that might have been present among the other polling stations. For example, within the Residuals vs Fitted plot, the linearity assumption is relatively sound until one takes into consideration the Palm Beach and Dade sites. Both are blatant violations of said assumption and would lead any reasonable data scientist to believe that some sort of tampering/manipulation was involved.
Take the log of both variables (Bush vote and Buchanan Vote) and repeat the analysis in (A.) Does your findings change?
While logging both the Buchanan
& Bush
variables seems to lessen the impact of the Palm Beach observation on the model, I would argue that the new results are still not significant enough to change my findings. I would still consider Palm Beach an outlier.
---
title: "Blog Post #5"
author: "Alexis Gamez"
description: "DACSS 603 HW#5"
date: "05/13/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- hw5
- correlation
- regressional analysis
- diagnostics
---
```{r, warning = F, message = F}
# setup
library(tidyverse)
library(alr4)
library(smss)
library(stargazer)
library(MPV)
knitr::opts_chunk$set(echo = TRUE)
```
# Question 1
```{r}
# loading data
data("house.selling.price.2")
```
*For the house.selling.price.2 data the tables below show a correlation matrix and a model fit using four predictors of selling price.*
## A)
**For backward elimination, which variable would be deleted first? Why?**
Using backward elimination, I would delete the `Beds` variable first. The point of backward elimination is to fit a model using all possible variables provided and 1 by 1 eliminating the least significant variables from said model. In this case, the `Beds` variable is the least significant in the model with a p-value of 0.487.
## B)
**For forward selection, which variable would be added first? Why?**
Forward selection works opposite to backward elimination where we start with an empty model and 1 by 1 add variables according to their significance level. Under these circumstances, I would add the `Size` variable first, as it is the most significant, with a p-value of 0.
## C)
**Why do you think that BEDS has such a large P-value in the multiple regression model, even though it has a substantial correlation with PRICE?**
If there is 1 thing we've learned this semester, it's that correlation does not dictate causation. So while both are strongly correlated, it appears that the significance of the `Beds` variable within the model diminishes with the addition of the other variables. In other words, the effect of `Beds` on `Price` is not as significant in comparison to the effect of the other variables on `Price`.
## D)
**Using software with these four predictors, find the model that would be selected using each criterion:**
```{r}
summary(house.selling.price.2)
```
For this question, I'll be utilizing stargazer to visualize the results to answer parts 1 & 2.
```{r}
fit1 <- (lm(P ~ S, data= house.selling.price.2))
fit2 <- (lm(P ~ S + New, data= house.selling.price.2))
fit3 <- (lm(P ~ S + Ba + New, data= house.selling.price.2))
fit4 <- (lm(P ~ S + Be + Ba + New, data= house.selling.price.2))
stargazer(fit1, fit2, fit3, fit4, type = 'text')
```
### 1. R^2
The best model according to this criteria would be model 4, as its R^2 value is the closest to 1 (0.869).
### 2. Adjusted R^2
The best model according to this criteria would be model 3, as its adjusted R^2 value is the closest to 1 (0.864).
### 3. PRESS
Model 1
```{r}
PRESS(fit1)
```
Model 2
```{r}
PRESS(fit2)
```
Model 3
```{r}
PRESS(fit3)
```
Model 4
```{r}
PRESS(fit4)
```
Knowing that a lower PRESS value indicates a better fit model, I would select model 3 according to this criteria.
### 4. AIC
Model 1
```{r}
AIC(fit1)
```
Model 2
```{r}
AIC(fit2)
```
Model 3
```{r}
AIC(fit3)
```
Model 4
```{r}
AIC(fit4)
```
Similar to the PRESS criteria, the lower the returned AIC value, the better the model fits. Again, under these circumstances I would select model 3 as the best fit.
### 5. BIC
Model 1
```{r}
BIC(fit1)
```
Model 2
```{r}
BIC(fit2)
```
Model 3
```{r}
BIC(fit3)
```
Model 4
```{r}
BIC(fit4)
```
Identical to AIC, the lowest BIC value indicates the best fit model. Once again, model 3 would be the best fit under these circumstances.
## E)
**Explain which model you prefer and why.**
The only models proven to be significant through previous criterion are models 3 & 4. Even then, model 4 was only proven to be significant according to it's R^2 value and I would argue that the more significant value to consider would be the adjusted R^2. Thus, if I were to select 1 of these 2 models, I would select model 3 as the best fit model, especially when considering the criterion previously calculated.
# Question 2
```{r}
# loading data
data("trees")
```
**Tree volume estimation is a big deal, especially in the lumber industry. Use the trees data to build a basic model of tree volume prediction. In particular,**
## A)
**Fit a multiple regression model with the Volume as the outcome and Girth and Height as the explanatory variables.**
```{r}
summary(trees)
```
With the data loaded, I'll fit the model below:
```{r}
fit_t <- lm(Volume ~ Girth + Height, data = trees)
summary(fit_t)
```
## B)
**Run regression diagnostic plots on the model. Based on the plots, do you think any of the regression assumptions is violated?**
```{r}
par(mfrow = c(2, 3)); plot(fit_t, which = 1:6)
```
Immediately, it's apparent that at least a couple of regression assumptions are violated throughout the 6 diagnostic plots. Most noteworthy, are the Residuals vs Fitted, Scale-Location and Cook's Distance plots. It's obvious that the for the 1st plot that the linearity assumption is violated. Similarly, the lack of a steady horizontal trend in the Scale-Location plot indicates a violation of the constant variance assumption. Lastly, it's apparent in the Cook's distance plot that the 31st observation, which is also an extreme outlier, holds significantly more weight within the model than any other telling me that the significance of all observations is skewed and not well fit.
# Question 3
```{r}
# loading data
data("florida")
```
**In the 2000 election for U.S. president, the counting of votes in Florida was controversial. In Palm Beach County in south Florida, for example, voters used a so-called butterfly ballot. Some believe that the layout of the ballot caused some voters to cast votes for Buchanan when their intended choice was Gore.**
## A)
**Run a simple linear regression model where the Buchanan vote is the outcome and the Bush vote is the explanatory variable. Produce the regression diagnostic plots. Is Palm Beach County an outlier based on the diagnostic plots? Why or why not?**
```{r}
summary(florida)
```
With the data loaded, I've fit & visualized the requested model below:
```{r}
fit_f <- lm(Buchanan ~ Bush, data = florida)
par(mfrow = c(2, 3)); plot(fit_f, which = 1:6)
```
With the model fit and the diagnostic plots created, it's extremely apparent that Palm Beach is indeed an outlier and an extreme one at that. In all plots, Palm Beach is shown to deviate entirely from any trend that might have been present among the other polling stations. For example, within the Residuals vs Fitted plot, the linearity assumption is relatively sound until one takes into consideration the Palm Beach and Dade sites. Both are blatant violations of said assumption and would lead any reasonable data scientist to believe that some sort of tampering/manipulation was involved.
## B)
**Take the log of both variables (Bush vote and Buchanan Vote) and repeat the analysis in (A.) Does your findings change?**
```{r}
fit_logf <- lm(log(Buchanan) ~ log(Bush), data = florida)
par(mfrow = c(2, 3)); plot(fit_logf, which = 1:6)
```
While logging both the `Buchanan` & `Bush` variables seems to lessen the impact of the Palm Beach observation on the model, I would argue that the new results are still not significant enough to change my findings. I would still consider Palm Beach an outlier.