hw1
challenge1
my name
dataset
ggplot2
Author

Paritosh G

Published

May 28, 2023

Loding the libraries

Code
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.1     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
Code
library(ggplot2)
library(smss)
library(alr4)
Loading required package: car
Loading required package: carData

Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some

Loading required package: effects
lattice theme set by effectsTheme()
See ?effectsTheme for details.
Code
library(magrittr)

Attaching package: 'magrittr'

The following object is masked from 'package:purrr':

    set_names

The following object is masked from 'package:tidyr':

    extract
Code
library(broom)

Question 1

Code
data("house.selling.price.2")

df <- house.selling.price.2
Code
house_price <- lm(P ~ S + Be + Ba + New, data = df)
summary(house_price)

Call:
lm(formula = P ~ S + Be + Ba + New, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-36.212  -9.546   1.277   9.406  71.953 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -41.795     12.104  -3.453 0.000855 ***
S             64.761      5.630  11.504  < 2e-16 ***
Be            -2.766      3.960  -0.698 0.486763    
Ba            19.203      5.650   3.399 0.001019 ** 
New           18.984      3.873   4.902  4.3e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.36 on 88 degrees of freedom
Multiple R-squared:  0.8689,    Adjusted R-squared:  0.8629 
F-statistic: 145.8 on 4 and 88 DF,  p-value: < 2.2e-16

a)

  • Beds would be deleted first as its p value of 0.4867 not statistically significant

b)

  • Size(s) has the smallest p value and hence in forward method when we start with no explanatory variable Size(s) should be added first.

c)

House with more bedrooms have most likely more number of bathrooms hence larger in size. If size, bedrooms, and bathrooms all are measuring price the high p value of bed is due to multi-collinearity.

d)

Code
fit_No_Bed <- lm(P ~ S + Ba + New, data = df)
fit_No_Bed_and_Bath <- lm(P ~ S + New, data = df)
fit_new <- lm(P ~ New, data = df)

R2

Code
summary(fit_No_Bed)$r.squared
[1] 0.8681361
Code
summary(fit_No_Bed_and_Bath)$r.squared
[1] 0.8483699
Code
summary(fit_new)$r.squared
[1] 0.1271307
  • fit_No_Bed model is the best with highest R2 and our best model.

Adjusted R2

Code
summary(fit_No_Bed)$adj.r.squared
[1] 0.8636912
Code
summary(fit_No_Bed_and_Bath)$adj.r.squared
[1] 0.8450003
Code
summary(fit_new)$adj.r.squared
[1] 0.1175388
  • fit_No_Bed model has the highest adjusted R squared and our best model.

Press

function for calculating press stat.

Code
press_stat <- function(model) {
  # Calculate PRESS residuals
  pr <- resid(model) / (1 - lm.influence(model)$hat)
  
  # Compute the PRESS statistic
  press <- sum(pr^2)
  
  return(press)
}

calculating Press statistic for all models

Code
press_stat(fit_No_Bed)
[1] 27860.05
Code
press_stat(fit_No_Bed_and_Bath)
[1] 31066
Code
press_stat(fit_new)
[1] 164039.3
  • fit_no_Bed_and_Bath has the lowest Press of 31066 and it is our best model.

AIC

Code
AIC(fit_No_Bed)
[1] 789.1366
Code
AIC(fit_No_Bed_and_Bath)
[1] 800.1262
Code
AIC(fit_new)
[1] 960.908
  • fit_No_bed has the lowest AIC score and is our best model.

BIC

Code
BIC(fit_No_Bed)
[1] 801.7996
Code
BIC(fit_No_Bed_and_Bath)
[1] 810.2566
Code
BIC(fit_new)
[1] 968.5058
  • fit_No_Bed has the lowest BIC score and is our best model

e)

hence model fit_no_Bed with variables size,bathroom and new is our best model.

Q.2)

Code
data(trees)

a)

Code
model <- lm(Volume ~ Girth + Height, data = trees)
summary(tree_model)
Error in summary(tree_model): object 'tree_model' not found

b)

Code
par(mfrow = c(2,3))
plot(model, which = 1:6)

Residuals vs Fitted

The curved shape of the line suggests this model violates our assumption of linearity that the relationship between the independent and dependent variables is linear.

Scale-Location

The curved shape of the line suggests this model violates our assumption of constant variance, or homoscedasticity, which affects the statistical significance of the model.

Cooks Distance, Residuals vs Leverage, Cooks dist vs Leverage

All three charts show there is single, potentially influence point, which can impact whether our model meets the assumptions of linear regression and can disproportionately affect our regression coefficients.

Q.3)

a)

Code
data("florida")
model_2 <- lm(Buchanan ~ Bush, data = florida)
par(mfrow = c(2,3))
plot(model_2, which = 1:6)

Yes, based on the diagnostic plots, Palm Beach County is an outlier.

On the Residuals vs. Fitted Values plot, the residuals largely “bounce randomly” around the zero line and form a roughly horizontal band around the zero line - with Dade and Palm Beach as significant outliers.

The Normal Q-Q plot follows a straight line and the points do not deviate significantly from the line - again, with two major exceptions: Dade, and Palm Beach. The Scale-Location plot is roughly horizonal, but again has a major exception: Palm Beach.

Finally, the Resisuals vs. Leverage plot shows outliers outside of Cook’s distance - Dade, and once again, Palm Beach.

b)

Code
LogBuchanan <- lm(log(Buchanan) ~ log(Bush), data = florida)
summary(LogBuchanan)

Call:
lm(formula = log(Buchanan) ~ log(Bush), data = florida)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.96075 -0.25949  0.01282  0.23826  1.66564 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.57712    0.38919  -6.622 8.04e-09 ***
log(Bush)    0.75772    0.03936  19.251  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4673 on 65 degrees of freedom
Multiple R-squared:  0.8508,    Adjusted R-squared:  0.8485 
F-statistic: 370.6 on 1 and 65 DF,  p-value: < 2.2e-16
Code
# Produce the regression diagnostic plots
par(mfrow = c(2, 2))
plot(LogBuchanan)

The p-value has gotten smaller and the adjusted r squared has gone from 0.3795 to 0.8485, suggesting that the log-log model is a better-fitting model. When it comes to the diagnostic plots, however, Palm Beach is still a notable outlier on the Residuals vs. Fitted, Normal Q-Q, and Scale-Location plots. It is now within Cook’s distance on the Residuals vs. Leverage plot.