Homework 3

hw3
Author

Saaradhaa M

Published

October 28, 2022

Qn 1.1.1

The predictor is ppgdp and the response is fertility.

Qn 1.1.2

# load libraries.
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.4.1 
✔ readr   2.1.3      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library(alr4)
Loading required package: car
Loading required package: carData

Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some

Loading required package: effects
lattice theme set by effectsTheme()
See ?effectsTheme for details.
library(smss)

# load dataset.
data(UN11)

# draw scatterplot.
scatterplot(fertility ~ ppgdp, UN11)

No, the graph seems curvilinear.

Qn 1.1.3

# draw scatterplot.
scatterplot (log(fertility) ~ log(ppgdp), UN11)

Yes, the simple linear regression model now seems plausible.

Qn 2a

We can test this using the UN11 dataset since ppgdp is in US dollars.

# create new variable.
UN11$british <- 1.33*UN11$ppgdp

# check slope.
summary(lm(fertility ~ british, UN11))

Call:
lm(formula = fertility ~ british, data = UN11)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.9006 -0.8801 -0.3547  0.6749  3.7585 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.178e+00  1.048e-01  30.331  < 2e-16 ***
british     -2.407e-05  3.500e-06  -6.877  7.9e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.206 on 197 degrees of freedom
Multiple R-squared:  0.1936,    Adjusted R-squared:  0.1895 
F-statistic: 47.29 on 1 and 197 DF,  p-value: 7.903e-11
summary(lm(fertility ~ ppgdp, UN11))

Call:
lm(formula = fertility ~ ppgdp, data = UN11)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.9006 -0.8801 -0.3547  0.6749  3.7585 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.178e+00  1.048e-01  30.331  < 2e-16 ***
ppgdp       -3.201e-05  4.655e-06  -6.877  7.9e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.206 on 197 degrees of freedom
Multiple R-squared:  0.1936,    Adjusted R-squared:  0.1895 
F-statistic: 47.29 on 1 and 197 DF,  p-value: 7.903e-11

The magnitude of the slope has reduced very slightly, although adjusted R^2 has not.

Qn 2b

We can test this too.

# correlation with US dollars.
cor(UN11$ppgdp, UN11$fertility)
[1] -0.4399891
# correlation with British pounds.
cor(UN11$british, UN11$fertility)
[1] -0.4399891

Since we multiplied by a constant, the correlation remains the same.

Qn 3

# load dataset.
data(water)

# generate scatterplots.
pairs(water)

Stream runoff (BSAAM) seems to have a positive linear relationship with precipitation at OPSLAKE, OPRC and OPBPC; but not with precipitation at APMAM, APSAB or APSLAKE. Stream runoff also seems to be fairly constant (?) over the years.

Qn 4

# load dataset.
data(Rateprof)

# create subset.
rateprof <- Rateprof %>% select(quality, helpfulness, clarity, easiness, raterInterest)

# generate scatterplots.
pairs(rateprof)

Quality, helpfulness and clarity have the clearest linear relationships with one another. Easiness and raterInterest do not seem to have linear relationships with the other variables.

Qn 5a

# load dataset.
data(student.survey)
glimpse(student.survey)
Rows: 60
Columns: 18
$ subj <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19…
$ ge   <fct> m, f, f, f, m, m, m, f, m, m, m, f, m, m, f, f, f, m, m, f, f, f,…
$ ag   <int> 32, 23, 27, 35, 23, 39, 24, 31, 34, 28, 23, 27, 36, 28, 28, 25, 4…
$ hi   <dbl> 2.2, 2.1, 3.3, 3.5, 3.1, 3.5, 3.6, 3.0, 3.0, 4.0, 2.3, 3.5, 3.3, …
$ co   <dbl> 3.5, 3.5, 3.0, 3.2, 3.5, 3.5, 3.7, 3.0, 3.0, 3.1, 2.6, 3.6, 3.5, …
$ dh   <int> 0, 1200, 1300, 1500, 1600, 350, 0, 5000, 5000, 900, 253, 190, 245…
$ dr   <dbl> 5.0, 0.3, 1.5, 8.0, 10.0, 3.0, 0.2, 1.5, 2.0, 2.0, 1.5, 3.0, 1.5,…
$ tv   <dbl> 3, 15, 0, 5, 6, 4, 5, 5, 7, 1, 10, 14, 6, 3, 4, 7, 6, 5, 6, 25, 4…
$ sp   <int> 5, 7, 4, 5, 6, 5, 12, 3, 5, 1, 15, 3, 15, 10, 3, 6, 7, 9, 12, 0, …
$ ne   <int> 0, 5, 3, 6, 3, 7, 4, 3, 3, 2, 1, 7, 12, 1, 1, 1, 3, 6, 2, 0, 4, 7…
$ ah   <int> 0, 6, 0, 3, 0, 0, 2, 1, 0, 1, 1, 0, 5, 2, 0, 0, 10, 10, 2, 2, 1, …
$ ve   <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…
$ pa   <fct> r, d, d, i, i, d, i, i, i, i, r, d, d, i, d, i, i, d, i, d, i, i,…
$ pi   <ord> conservative, liberal, liberal, moderate, very liberal, liberal, …
$ re   <ord> most weeks, occasionally, most weeks, occasionally, never, occasi…
$ ab   <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…
$ aa   <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…
$ ld   <lgl> FALSE, NA, NA, FALSE, FALSE, NA, FALSE, FALSE, NA, FALSE, FALSE, …
# generate plots.
boxplot(pi ~ re, student.survey)

scatterplot(hi ~ tv, student.survey)

  • Religiosity and conservatism seem to have a positive relationship.
  • High school GPA and TV-watching seem to have a negative relationship.

Qn 5b

# change pi to numeric variable.
student.survey$pi <- as.numeric(student.survey$pi)

# removing ordering in re and rename it.
levels(student.survey$re) <- c("N", "O", "M", "E")
student.survey$re <- factor(student.survey$re, ordered = FALSE)

# run regression models.
summary(lm(pi ~ re, student.survey))

Call:
lm(formula = pi ~ re, data = student.survey)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.8889 -0.5172 -0.2667  1.2040  2.7333 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.2667     0.3394   6.678 1.18e-08 ***
reO           0.2506     0.4181   0.599 0.551374    
reM           2.1619     0.6017   3.593 0.000691 ***
reE           2.6222     0.5543   4.731 1.56e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.315 on 56 degrees of freedom
Multiple R-squared:  0.3872,    Adjusted R-squared:  0.3544 
F-statistic:  11.8 on 3 and 56 DF,  p-value: 4.282e-06
summary(lm(hi ~ tv, student.survey))

Call:
lm(formula = hi ~ tv, data = student.survey)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.2583 -0.2456  0.0417  0.3368  0.7051 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.441353   0.085345  40.323   <2e-16 ***
tv          -0.018305   0.008658  -2.114   0.0388 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4467 on 58 degrees of freedom
Multiple R-squared:  0.07156,   Adjusted R-squared:  0.05555 
F-statistic: 4.471 on 1 and 58 DF,  p-value: 0.03879
  • Those who attended religious services most weeks/every week were significantly more likely to be conservative than those who never did, p < .001. There was no significant difference in political ideology between those who occasionally attended religious services and those who never did.
  • Watching less hours of TV per week was associated with higher high-school GPAs, p < .05. That being said, as the R2 is fairly low, hours of TV watching is not a great predictor of high school GPA.