Homework_5

hw1

challenge1

my name

dataset

ggplot2

Author

Paritosh G

Published

May 28, 2023

Loding the libraries

Code

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.1     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors

Code

library(ggplot2)
library(smss)
library(alr4)

Loading required package: car
Loading required package: carData

Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some

Loading required package: effects
lattice theme set by effectsTheme()
See ?effectsTheme for details.

Code

library(magrittr)


Attaching package: 'magrittr'

The following object is masked from 'package:purrr':

    set_names

The following object is masked from 'package:tidyr':

    extract

Code

library(broom)

Question 1

Code

data("house.selling.price.2")

df <- house.selling.price.2

Code

house_price <- lm(P ~ S + Be + Ba + New, data = df)
summary(house_price)


Call:
lm(formula = P ~ S + Be + Ba + New, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-36.212  -9.546   1.277   9.406  71.953 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -41.795     12.104  -3.453 0.000855 ***
S             64.761      5.630  11.504  < 2e-16 ***
Be            -2.766      3.960  -0.698 0.486763    
Ba            19.203      5.650   3.399 0.001019 ** 
New           18.984      3.873   4.902  4.3e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.36 on 88 degrees of freedom
Multiple R-squared:  0.8689,    Adjusted R-squared:  0.8629 
F-statistic: 145.8 on 4 and 88 DF,  p-value: < 2.2e-16

a)

Beds would be deleted first as its p value of 0.4867 not statistically significant

b)

Size(s) has the smallest p value and hence in forward method when we start with no explanatory variable Size(s) should be added first.

c)

House with more bedrooms have most likely more number of bathrooms hence larger in size. If size, bedrooms, and bathrooms all are measuring price the high p value of bed is due to multi-collinearity.

d)

Code

fit_No_Bed <- lm(P ~ S + Ba + New, data = df)
fit_No_Bed_and_Bath <- lm(P ~ S + New, data = df)
fit_new <- lm(P ~ New, data = df)

Code

summary(fit_No_Bed)$r.squared

[1] 0.8681361

Code

summary(fit_No_Bed_and_Bath)$r.squared

[1] 0.8483699

Code

summary(fit_new)$r.squared

[1] 0.1271307

fit_No_Bed model is the best with highest R2 and our best model.

Adjusted R2

Code

summary(fit_No_Bed)$adj.r.squared

[1] 0.8636912

Code

summary(fit_No_Bed_and_Bath)$adj.r.squared

[1] 0.8450003

Code

summary(fit_new)$adj.r.squared

[1] 0.1175388

fit_No_Bed model has the highest adjusted R squared and our best model.

Press

function for calculating press stat.

Code

press_stat <- function(model) {
  # Calculate PRESS residuals
  pr <- resid(model) / (1 - lm.influence(model)$hat)
  
  # Compute the PRESS statistic
  press <- sum(pr^2)
  
  return(press)
}

calculating Press statistic for all models

Code

press_stat(fit_No_Bed)

[1] 27860.05

Code

press_stat(fit_No_Bed_and_Bath)

[1] 31066

Code

press_stat(fit_new)

[1] 164039.3

fit_no_Bed_and_Bath has the lowest Press of 31066 and it is our best model.

AIC

Code

AIC(fit_No_Bed)

[1] 789.1366

Code

AIC(fit_No_Bed_and_Bath)

[1] 800.1262

Code

AIC(fit_new)

[1] 960.908

fit_No_bed has the lowest AIC score and is our best model.

BIC

Code

BIC(fit_No_Bed)

[1] 801.7996

Code

BIC(fit_No_Bed_and_Bath)

[1] 810.2566

Code

BIC(fit_new)

[1] 968.5058

fit_No_Bed has the lowest BIC score and is our best model

e)

hence model fit_no_Bed with variables size,bathroom and new is our best model.

Q.2)

Code

data(trees)

a)

Code

model <- lm(Volume ~ Girth + Height, data = trees)
summary(tree_model)

Error in summary(tree_model): object 'tree_model' not found

b)

Code

par(mfrow = c(2,3))
plot(model, which = 1:6)

Residuals vs Fitted

The curved shape of the line suggests this model violates our assumption of linearity that the relationship between the independent and dependent variables is linear.

Scale-Location

The curved shape of the line suggests this model violates our assumption of constant variance, or homoscedasticity, which affects the statistical significance of the model.

Cooks Distance, Residuals vs Leverage, Cooks dist vs Leverage

All three charts show there is single, potentially influence point, which can impact whether our model meets the assumptions of linear regression and can disproportionately affect our regression coefficients.

Q.3)

a)

Code

data("florida")
model_2 <- lm(Buchanan ~ Bush, data = florida)
par(mfrow = c(2,3))
plot(model_2, which = 1:6)

Yes, based on the diagnostic plots, Palm Beach County is an outlier.

On the Residuals vs. Fitted Values plot, the residuals largely “bounce randomly” around the zero line and form a roughly horizontal band around the zero line - with Dade and Palm Beach as significant outliers.

The Normal Q-Q plot follows a straight line and the points do not deviate significantly from the line - again, with two major exceptions: Dade, and Palm Beach. The Scale-Location plot is roughly horizonal, but again has a major exception: Palm Beach.

Finally, the Resisuals vs. Leverage plot shows outliers outside of Cook’s distance - Dade, and once again, Palm Beach.

b)

Code

LogBuchanan <- lm(log(Buchanan) ~ log(Bush), data = florida)
summary(LogBuchanan)


Call:
lm(formula = log(Buchanan) ~ log(Bush), data = florida)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.96075 -0.25949  0.01282  0.23826  1.66564 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.57712    0.38919  -6.622 8.04e-09 ***
log(Bush)    0.75772    0.03936  19.251  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4673 on 65 degrees of freedom
Multiple R-squared:  0.8508,    Adjusted R-squared:  0.8485 
F-statistic: 370.6 on 1 and 65 DF,  p-value: < 2.2e-16

Code

# Produce the regression diagnostic plots
par(mfrow = c(2, 2))
plot(LogBuchanan)

The p-value has gotten smaller and the adjusted r squared has gone from 0.3795 to 0.8485, suggesting that the log-log model is a better-fitting model. When it comes to the diagnostic plots, however, Palm Beach is still a notable outlier on the Residuals vs. Fitted, Normal Q-Q, and Scale-Location plots. It is now within Cook’s distance on the Residuals vs. Leverage plot.