Homework 3

Author

Nick Boonstra

Published

November 4, 2022

library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.3.6      ✔ purrr   0.3.5 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.4.1 
✔ readr   2.1.3      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

library(alr4)

Loading required package: car
Loading required package: carData

Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some

Loading required package: effects
lattice theme set by effectsTheme()
See ?effectsTheme for details.

library(smss)

Warning: package 'smss' was built under R version 4.2.2

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Question 1

un11 <- tibble(UN11)

head(un11,20)

# A tibble: 20 × 6
   region     group  fertility  ppgdp lifeExpF pctUrban
   <fct>      <fct>      <dbl>  <dbl>    <dbl>    <dbl>
 1 Asia       other       5.97   499      49.5       23
 2 Europe     other       1.52  3677.     80.4       53
 3 Africa     africa      2.14  4473      75         67
 4 Africa     africa      5.14  4322.     53.2       59
 5 Caribbean  other       2    13750.     81.1      100
 6 Latin Amer other       2.17  9162.     79.9       93
 7 Asia       other       1.74  3031.     77.3       64
 8 Caribbean  other       1.67 22852.     77.8       47
 9 Oceania    oecd        1.95 57119.     84.3       89
10 Europe     oecd        1.35 45159.     83.6       68
11 Asia       other       2.15  5638.     73.7       52
12 Caribbean  other       1.88 22462.     78.8       84
13 Asia       other       2.43 18184.     76.1       89
14 Asia       other       2.16   670.     70.2       29
15 Caribbean  other       1.58 14497.     80.3       45
16 Europe     other       1.48  5702      76.4       75
17 Europe     oecd        1.84 43815.     82.8       97
18 Latin Amer other       2.68  4496.     77.8       53
19 Africa     africa      5.08   741.     58.7       42
20 Caribbean  other       1.76 92625.     82.3      100

summary(un11)

        region      group       fertility         ppgdp         
 Africa    :53   oecd  : 31   Min.   :1.134   Min.   :   114.8  
 Asia      :50   other :115   1st Qu.:1.754   1st Qu.:  1283.0  
 Europe    :39   africa: 53   Median :2.262   Median :  4684.5  
 Latin Amer:20                Mean   :2.761   Mean   : 13012.0  
 Caribbean :17                3rd Qu.:3.545   3rd Qu.: 15520.5  
 Oceania   :17                Max.   :6.925   Max.   :105095.4  
 (Other)   : 3                                                  
    lifeExpF        pctUrban     
 Min.   :48.11   Min.   : 11.00  
 1st Qu.:65.66   1st Qu.: 39.00  
 Median :75.89   Median : 59.00  
 Mean   :72.29   Mean   : 57.93  
 3rd Qu.:79.58   3rd Qu.: 75.00  
 Max.   :87.12   Max.   :100.00

This question involves using United Nations data to “study the dependence of fertility on ppgdp” (the dependence of births per 1000 females on GDP in U.S. dollars per person).

1.1.1

In this model, ppgdp would be the predictor variable, and fertility would be the response variable.

1.1.2

un11 %>% 
  ggplot(aes(x=ppgdp,y=fertility)) +
  geom_point() +
  geom_smooth(method=lm,se=F)

A linear OLS regression of a linear-linear model of the data does not seem to fit the data very well. While the regression does capture the general downward trend of the data, it is plain to see in the above visualization that it is not a good fit, particularly for larger values of the response variable (ppgdp).

1.1.3

un11 %>% 
  ggplot(aes(x=log(ppgdp),y=log(fertility))) +
  geom_point() +
  geom_smooth(method=lm,se=F)

This log-log model of the data is much better suited to a linear OLS regression fit.

Question 2

q2 <- data.frame(
  life_expec = c(77, 80, 67, 59, 91),
  usd = c(70000, 91000, 65000, 45000, 122000)
)

q2 <- q2 %>% 
  mutate(gbp = usd * 1.33)

q2

  life_expec    usd    gbp
1         77  70000  93100
2         80  91000 121030
3         67  65000  86450
4         59  45000  59850
5         91 122000 162260

q2 %>% 
  ggplot(aes(x=usd,y=life_expec)) +
  geom_point() +
  geom_smooth(method=lm,se=F)

q2 %>% 
  ggplot(aes(x=gbp,y=life_expec)) +
  geom_point() +
  geom_smooth(method=lm,se=F)

summary(lm(life_expec ~ usd, data = q2))


Call:
lm(formula = life_expec ~ usd, data = q2)

Residuals:
      1       2       3       4       5 
 5.6790  0.1838 -2.2984 -2.2078 -1.3567 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)   
(Intercept) 4.300e+01  5.438e+00   7.908  0.00422 **
usd         4.045e-04  6.565e-05   6.162  0.00860 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.842 on 3 degrees of freedom
Multiple R-squared:  0.9268,    Adjusted R-squared:  0.9024 
F-statistic: 37.97 on 1 and 3 DF,  p-value: 0.008601

summary(lm(life_expec ~ gbp, data = q2))


Call:
lm(formula = life_expec ~ gbp, data = q2)

Residuals:
      1       2       3       4       5 
 5.6790  0.1838 -2.2984 -2.2078 -1.3567 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)   
(Intercept) 4.300e+01  5.438e+00   7.908  0.00422 **
gbp         3.042e-04  4.936e-05   6.162  0.00860 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.842 on 3 degrees of freedom
Multiple R-squared:  0.9268,    Adjusted R-squared:  0.9024 
F-statistic: 37.97 on 1 and 3 DF,  p-value: 0.008601

As can be seen by this example data, transforming a dataset between USD and GBP (that is, adding a linear coefficient to a variable) does change the slope coefficient of the prediction equation (corresponding to the ratio of the value of the two currencies), but it does not alter the overall correlation between the two variables.

Question 3

water_data <- tibble(water)

head(water_data,10)

# A tibble: 10 × 8
    Year APMAM APSAB APSLAKE OPBPC  OPRC OPSLAKE  BSAAM
   <int> <dbl> <dbl>   <dbl> <dbl> <dbl>   <dbl>  <int>
 1  1948  9.13  3.58    3.91  4.1   7.43    6.47  54235
 2  1949  5.28  4.82    5.2   7.55 11.1    10.3   67567
 3  1950  4.2   3.77    3.67  9.52 12.2    11.4   66161
 4  1951  4.6   4.46    3.93 11.1  15.2    11.1   68094
 5  1952  7.15  4.99    4.88 16.3  20.0    22.8  107080
 6  1953  9.7   5.65    4.91  8.88  8.15    7.41  67594
 7  1954  5.02  1.45    1.77 13.6  12.4    13.3   65356
 8  1955  6.7   7.44    6.51  9.28  9.65    9.8   67909
 9  1956 10.5   5.85    3.38 21.2  18.6    17.4   92715
10  1957  9.1   6.13    4.08  9.55  9.2     8.25  70024

pairs(water_data[,2:8])

The correlation between stream runoff near Bishop, CA and rainfall at these six sites in the Sierra Nevada mountains can be interpreted from the scatterplots along the bottom row or the right-most column in the scatterplot matrix (they are both the same information). From this scatterplot matrix, it would appear that correlation between Bishop, CA runoff and Sierra Nevada rainfall is strongest for the OPSLAKE, OPRC, and OPBPC variables. In these plots, the data most clearly follow a linear trend, and the data are mostly staying close to a fit line.

Question 4

rateprof <- tibble(Rateprof)
head(rateprof,10)

# A tibble: 10 × 17
   gender numYears numRat…¹ numCo…² pepper disci…³ dept  quality helpf…⁴ clarity
   <fct>     <int>    <int>   <int> <fct>  <fct>   <fct>   <dbl>   <dbl>   <dbl>
 1 male          7       11       5 no     Hum     Engl…    4.64    4.64    4.64
 2 male          6       11       5 no     Hum     Reli…    4.32    4.55    4.09
 3 male         10       43       2 no     Hum     Art      4.79    4.72    4.86
 4 male         11       24       5 no     Hum     Engl…    4.25    4.46    4.04
 5 male         11       19       7 no     Hum     Span…    4.68    4.68    4.68
 6 male         10       15       9 no     Hum     Span…    4.23    4.27    4.2 
 7 male          7       17       3 no     Hum     Span…    4.38    4.35    4.41
 8 male         11       16       3 no     Hum     Engl…    2.06    2.06    2.06
 9 male         11       12       4 no     Hum     Music    2.04    2.17    2   
10 male          7       18       4 no     Hum     Engl…    4.11    4.22    4   
# … with 7 more variables: easiness <dbl>, raterInterest <dbl>,
#   sdQuality <dbl>, sdHelpfulness <dbl>, sdClarity <dbl>, sdEasiness <dbl>,
#   sdRaterInterest <dbl>, and abbreviated variable names ¹numRaters,
#   ²numCourses, ³discipline, ⁴helpfulness

pairs(rateprof[,8:12])

There is a strong positive correlation between quality, helpfulness, and clarity; besides that, the trend seems to be upward for all other pairs, but the correlation is not as strong.

Question 5

data(student.survey)

summary(student.survey)

      subj       ge           ag              hi              co       
 Min.   : 1.00   f:31   Min.   :22.00   Min.   :2.000   Min.   :2.600  
 1st Qu.:15.75   m:29   1st Qu.:24.00   1st Qu.:3.000   1st Qu.:3.175  
 Median :30.50          Median :26.50   Median :3.350   Median :3.500  
 Mean   :30.50          Mean   :29.17   Mean   :3.308   Mean   :3.453  
 3rd Qu.:45.25          3rd Qu.:31.00   3rd Qu.:3.625   3rd Qu.:3.725  
 Max.   :60.00          Max.   :71.00   Max.   :4.000   Max.   :4.000  
                                                                       
       dh             dr               tv               sp        
 Min.   :   0   Min.   : 0.200   Min.   : 0.000   Min.   : 0.000  
 1st Qu.: 205   1st Qu.: 1.450   1st Qu.: 3.000   1st Qu.: 3.000  
 Median : 640   Median : 2.000   Median : 6.000   Median : 5.000  
 Mean   :1232   Mean   : 3.818   Mean   : 7.267   Mean   : 5.483  
 3rd Qu.:1350   3rd Qu.: 5.000   3rd Qu.:10.000   3rd Qu.: 7.000  
 Max.   :8000   Max.   :20.000   Max.   :37.000   Max.   :16.000  
                                                                  
       ne               ah             ve          pa    
 Min.   : 0.000   Min.   : 0.000   Mode :logical   d:21  
 1st Qu.: 2.000   1st Qu.: 0.000   FALSE:60        i:24  
 Median : 3.000   Median : 0.500                   r:15  
 Mean   : 4.083   Mean   : 1.433                         
 3rd Qu.: 5.250   3rd Qu.: 2.000                         
 Max.   :14.000   Max.   :11.000                         
                                                         
                     pi                re         ab              aa         
 very liberal         : 8   never       :15   Mode :logical   Mode :logical  
 liberal              :24   occasionally:29   FALSE:60        FALSE:59       
 slightly liberal     : 6   most weeks  : 7                   NA's :1        
 moderate             :10   every week  : 9                                  
 slightly conservative: 6                                                    
 conservative         : 4                                                    
 very conservative    : 2                                                    
     ld         
 Mode :logical  
 FALSE:44       
 NA's :16

Some of the variables in question for the analysis will need to be recoded to ordinal variables before a proper regression analysis can be performed.

student_survey <- student.survey %>% 
  mutate(pi_ord = case_when(
    pi == "very liberal" ~ -3,
    pi == "liberal" ~ -2,
    pi == "slightly liberal" ~ -1,
    pi == "moderate" ~ 0,
    pi == "slightly conservative" ~ 1,
    pi == "conservative" ~ 2,
    pi == "very conservative" ~ 3
  )) %>% 
  mutate(re_ord = case_when(
    re == "never" ~ 0,
    re == "occasionally" ~ 1,
    re == "most weeks" ~ 2,
    re == "every week" ~ 3
  ))

summary(lm(pi_ord ~ re_ord, data = student_survey))


Call:
lm(formula = pi_ord ~ re_ord, data = student_survey)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.81243 -0.87160  0.09882  1.12840  3.09882 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -2.0988     0.2717  -7.724 1.78e-10 ***
re_ord        0.9704     0.1792   5.416 1.22e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.345 on 58 degrees of freedom
Multiple R-squared:  0.3359,    Adjusted R-squared:  0.3244 
F-statistic: 29.34 on 1 and 58 DF,  p-value: 1.221e-06

student_survey %>% 
  ggplot(aes(x=re_ord,y=pi_ord)) +
  geom_point(alpha=0.12) +
  geom_smooth(method=(lm))

A one-point ordinal increase in church attendance, based upon this response scale, corresponds to nearly a one-point ordinal increase in conservative ideology, with zero church attendance (“Never”) corresponding to a “Liberal” ideology, all to a strong level of statistical significance.

summary(lm(hi ~ tv, data = student_survey))


Call:
lm(formula = hi ~ tv, data = student_survey)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.2583 -0.2456  0.0417  0.3368  0.7051 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.441353   0.085345  40.323   <2e-16 ***
tv          -0.018305   0.008658  -2.114   0.0388 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4467 on 58 degrees of freedom
Multiple R-squared:  0.07156,   Adjusted R-squared:  0.05555 
F-statistic: 4.471 on 1 and 58 DF,  p-value: 0.03879

student_survey %>% 
  ggplot(aes(x=tv,y=hi)) +
  geom_point(alpha=0.5) +
  geom_smooth(method=(lm))

Each additional hour of TV watched corresponds to a roughly 0.02 point decrease in a student’s high school GPA. However, there are a few outliers, particularly along the predictor variable, that may skew this result.