Homework 5

hw4

Author

Ethan Campbell

Published

December 8, 2022

Homework 5 DACSS 603, Fall 2022 Due: December 9, 2022, 11.59pm

Please consult the relevant tutorials if you’re having trouble with coding the answers. Please write up your solutions as a .qmd (Quarto) document and publish in the Course Blog.

Some of the questions use data from the alr4 and smss R packages. You would need to install those packages in R (no need for an install.packages() call in your .qmd file, though—just use library()) and load the data using the data() function.

Question 1

(Data file: house.selling.price.2 from smss R package)

For the house.selling.price.2 data the tables below show a correlation matrix and a model fit using four predictors of selling price.

Hint 1: You should be able to answer A, B, C just using the tables below, although you should feel free to load the data in R and work with it if you so choose. They will be consistent with what you see on the tables.

Hint 2: The p-value of a variable in a simple linear regression is the same p-value one would get from a Pearson’s correlation (cor.test). The p-value is a function of the magnitude of the correlation coefficient (the higher the coefficient, the lower the p-value) and of sample size (larger samples lead to smaller p-values). For the correlations shown in the tables, they are between variables of the same length.)

With these four predictors,

1.1

For backward elimination, which variable would be deleted first? Why?

For backward elimination we remove the one with the largest p values which would be beds as its p value is .487.

1.2

For forward selection, which variable would be added first? Why?

For forward selection we add the ones with the lowest p-value which would be size or new. However, while they have the same p-value size has a higher correlation based on the tabes above I would add size first then add new followed by bath.

1.3

Why do you think that BEDS has such a large P-value in the multiple regression model, even though it has a substantial correlation with PRICE?

I believe this is not showing a sub .05 p-value because either the model is too complex and considering the other variables it throughs off the bed variable. This is also a fairly small sample size of ~90 so we can also say maybe there is not enough infomration to accurately describe this effect.

1.4

Using software with these four predictors, find the model that would be selected using each criterion:

R2 - Multiple linear regression model using lm function followed by summary. This will show the r-squared in the summary so it is really easy to find this information which is nice. Adjusted R2 - this will be the same as R2. This will show the adjusted r-squared for the regression in the summary so it is really easy to find this infomration. PRESS - utilizing the MPV package we can use the function press which simiply needs the linear model you used. This is much shorter than writing it all out and much easier. AIC - for this we can use the AIC function on the model that we have created before. BIC - for this we can use the BIC function on the model that we have created before.

Code

library(smss)
library(MPV)

Loading required package: lattice

Loading required package: KernSmooth

KernSmooth 2.23 loaded
Copyright M. P. Wand 1997-2009

Code

data("house.selling.price.2")
mod1 <- lm(P ~ S + Be + Ba + New, data = house.selling.price.2)
summary(mod1)


Call:
lm(formula = P ~ S + Be + Ba + New, data = house.selling.price.2)

Residuals:
    Min      1Q  Median      3Q     Max 
-36.212  -9.546   1.277   9.406  71.953 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -41.795     12.104  -3.453 0.000855 ***
S             64.761      5.630  11.504  < 2e-16 ***
Be            -2.766      3.960  -0.698 0.486763    
Ba            19.203      5.650   3.399 0.001019 ** 
New           18.984      3.873   4.902  4.3e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.36 on 88 degrees of freedom
Multiple R-squared:  0.8689,    Adjusted R-squared:  0.8629 
F-statistic: 145.8 on 4 and 88 DF,  p-value: < 2.2e-16

Code

AIC(mod1)

[1] 790.6225

Code

BIC(mod1)

[1] 805.8181

Code

PRESS(mod1)

[1] 28390.22

Explain

Explain which model you prefer and why.

I usually go off the R2 for simple models as there is not too much to take into consideration but as it gets more complex I believe there are certain factors that need to be taken into consideration. When the equation gets more complex I would prefer using the AIC as it takes into consideration the other models this means that they are relative. This means that when one model gets to complex it can get penalized and not reflect a better score so this helps prevent over-complication of models.

Question 2

(Data file: trees from base R) From the documentation: “This data set provides measurements of the diameter, height and volume of timber in 31 felled black cherry trees. Note that the diameter (in inches) is erroneously labeled Girth in the data. It is measured at 4 ft 6 in above the ground.”

Tree volume estimation is a big deal, especially in the lumber industry. Use the trees data to build a basic model of tree volume prediction. In particular,

2.1

fit a multiple regression model with the Volume as the outcome and Girth and Height as the explanatory variables Run regression diagnostic plots on the model. Based on the plots, do you think any of the regression assumptions is violated?

After running the regression modeling and plotting to look at the data we can see there is definatly something wrong with this infomation and it seems to break the regression assumptions. It appears that linearity is a problem as the residuals vs fitted plot does not show a linear line and is very curved. This is due to the lack of constant variance resulting in the line being curved. The scale-location plot also looks like it has been violated as it is also not linear and shows a very strong dip then positive line.

Code

data("trees")

tree <- lm(Volume ~ Girth + Height, data = trees)

summary(tree)


Call:
lm(formula = Volume ~ Girth + Height, data = trees)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.4065 -2.6493 -0.2876  2.2003  8.4847 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -57.9877     8.6382  -6.713 2.75e-07 ***
Girth         4.7082     0.2643  17.816  < 2e-16 ***
Height        0.3393     0.1302   2.607   0.0145 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared:  0.948, Adjusted R-squared:  0.9442 
F-statistic:   255 on 2 and 28 DF,  p-value: < 2.2e-16

Code

plot(tree)

Question 3

(Data file: florida in alr R package)

In the 2000 election for U.S. president, the counting of votes in Florida was controversial. In Palm Beach County in south Florida, for example, voters used a so-called butterfly ballot. Some believe that the layout of the ballot caused some voters to cast votes for Buchanan when their intended choice was Gore.

The data has variables for the number of votes for each candidate—Gore, Bush, and Buchanan.

Run a simple linear regression model where the Buchanan vote is the outcome and the Bush vote is the explanatory variable. Produce the regression diagnostic plots. Is Palm Beach County an outlier based on the diagnostic plots? Why or why not? Take the log of both variables (Bush vote and Buchanan Vote) and repeat the analysis in (a). Does your findings change?

Code

library(alr4)

Loading required package: car

Loading required package: carData

Loading required package: effects

Use the command
    lattice::trellis.par.set(effectsTheme())
  to customize lattice options for effects plots.
See ?effectsTheme for details.

Code

data("florida")

Buchanan <- lm(Buchanan ~ Bush, data=florida)
mod <- lm(log(Buchanan) ~ log(Bush), data = florida)

par(mfrow = c(2,3)); plot(Buchanan, which = 1:6)

Code

par(mfrow = c(2,3)); plot(mod, which = 1:6)

The first major thing that I see is that palm beach is a major outlier a probably the cause of these plots being interesting. Palm beach is way outside of the major group of points and this is causing the linear lines to be shifted and causing a less robust study to be done. When we use a double log on the linear model we notice that it corrects it a little bit but palm beach is still much higher than the other variables and is still resulting in it being an outlier.