Homework 3 - Prahitha Movva

hw3
regression
correlation
covariance
The third homework
Author

Prahitha Movva

Published

October 31, 2022

Code
library(readxl)
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.3.6      ✔ purrr   0.3.5 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.4.1 
✔ readr   2.1.3      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Code
library(dplyr)
library(stats)
library(alr4)
Loading required package: car
Loading required package: carData

Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some

Loading required package: effects
lattice theme set by effectsTheme()
See ?effectsTheme for details.
Code
library(smss)
Warning: package 'smss' was built under R version 4.2.2
Code
knitr::opts_chunk$set(echo=TRUE, warning=FALSE)

Question 1

1.1.1

The predictor is ppgdp and the response is fertility because we are studying the dependence of fertility on ppgdp.

1.1.2

Code
ggplot(data=UN11, aes(x=ppgdp, y=fertility))+geom_point()

There are very few observations with gross national product per person between 20,000 and 100,000 when compared to those below 20,000 (in addition to the huge dip in fertility). Since the above graph does not seem to exhibit a linear relationship, it is not plausible for it to have a straight-line mean function.

1.1.3

Code
ggplot(data=UN11, aes(x=log(ppgdp), y=log(fertility)))+geom_point()

We can see that as log(ppgdp) increases, log(fertility) decreases linearly. So, we can say that the relationship between these two variables is linear and a simple linear regression model would be plausible.

Question 2

a

The slope gets divided by 1.33 due to the conversion from US Dollar to British Pound. So the slope of the British prediction equation will be ~25% less than the US prediction equation. We can also observe this from the below graph, where the British equation is represented in red and the US in blue.

Code
UN11$british.ppgdp <- UN11$ppgdp/1.33

ggplot() +
  geom_smooth(data=UN11, aes(x = british.ppgdp, y = fertility), 
              method = "lm", se = FALSE, color = "red") + 
  geom_smooth(data=UN11, aes(x = ppgdp, y = fertility), 
              method = "lm", se = FALSE, color = "blue") + 
  geom_point(data=UN11, aes(x = british.ppgdp, y = fertility), color = "red") + 
  geom_point(data=UN11, aes(x = ppgdp, y = fertility), color = "blue")
`geom_smooth()` using formula 'y ~ x'
`geom_smooth()` using formula 'y ~ x'

b

The correlation between the explanatory variable and the response will not change as both will increase 1.33x

Code
UN11$british.ppgdp <- UN11$ppgdp/1.33

cor(UN11$british.ppgdp, UN11$fertility)
[1] -0.4399891
Code
cor(UN11$ppgdp, UN11$fertility)
[1] -0.4399891

Question 3

Code
data(water)
pairs(water)

From the matrix, we can see that the first three locations - APMAM, APSAB and APSLAKE - are strongly correlated with each other. Similarly, the next three location - OPBPC, OPRC and OPSLAKE - are also correlated with each other. However, the first three locations do not share a strong correlation with the response - BSAAM, whereas the next three locations do. This implies that using one of the last three locations will give a better fit/ predictions for the stream runoff volume. On close inspection, we can further say that OPSLAKE might be a better choice among the last three locations to predict BSAAM.

Code
summary(lm(water$BSAAM ~ water$OPBPC, data=water))$adj.r.squared
[1] 0.7792942
Code
summary(lm(water$BSAAM ~ water$OPRC, data=water))$adj.r.squared
[1] 0.8419507
Code
summary(lm(water$BSAAM ~ water$OPSLAKE, data=water))$adj.r.squared
[1] 0.8777515

Question 4

Code
data(Rateprof)
head(Rateprof)
  gender numYears numRaters numCourses pepper discipline              dept
1   male        7        11          5     no        Hum           English
2   male        6        11          5     no        Hum Religious Studies
3   male       10        43          2     no        Hum               Art
4   male       11        24          5     no        Hum           English
5   male       11        19          7     no        Hum           Spanish
6   male       10        15          9     no        Hum           Spanish
   quality helpfulness  clarity easiness raterInterest sdQuality sdHelpfulness
1 4.636364    4.636364 4.636364 4.818182      3.545455 0.5518564     0.6741999
2 4.318182    4.545455 4.090909 4.363636      4.000000 0.9020179     0.9341987
3 4.790698    4.720930 4.860465 4.604651      3.432432 0.4529343     0.6663898
4 4.250000    4.458333 4.041667 2.791667      3.181818 0.9325048     0.9315329
5 4.684211    4.684211 4.684211 4.473684      4.214286 0.6500112     0.8200699
6 4.233333    4.266667 4.200000 4.533333      3.916667 0.8632717     1.0327956
  sdClarity sdEasiness sdRaterInterest
1 0.5045250  0.4045199       1.1281521
2 0.9438798  0.5045250       1.0744356
3 0.4129681  0.5407021       1.2369438
4 0.9990938  0.5882300       1.3322506
5 0.5823927  0.6117753       0.9749613
6 0.7745967  0.6399405       0.6685579
Code
rates <- Rateprof %>%
  select(quality, helpfulness, clarity, easiness, raterInterest)
pairs(rates)

All of them have a positive correlation but with different magnitudes. Quality, helpfulness and clarity seem to have a linear relationship with one another with a strong positive correlation. Quality, with easiness and rater interest has a positive correlation but is weak (weaker for rater interest). The plot for easiness and rater interest seems flat amongst all, implying almost no (very weak) correlation.

Question 5

Code
data(student.survey)
pi.clean <- as.numeric(student.survey$pi)
re.clean <- as.numeric(student.survey$re)
model.i <- lm(pi.clean ~ re.clean, data=student.survey)
summary(model.i)

Call:
lm(formula = pi.clean ~ re.clean, data = student.survey)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.81243 -0.87160  0.09882  1.12840  3.09882 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.9308     0.4252   2.189   0.0327 *  
re.clean      0.9704     0.1792   5.416 1.22e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.345 on 58 degrees of freedom
Multiple R-squared:  0.3359,    Adjusted R-squared:  0.3244 
F-statistic: 29.34 on 1 and 58 DF,  p-value: 1.221e-06
Code
model.ii <- lm(hi ~ tv, data = student.survey)
summary(model.ii)

Call:
lm(formula = hi ~ tv, data = student.survey)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.2583 -0.2456  0.0417  0.3368  0.7051 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.441353   0.085345  40.323   <2e-16 ***
tv          -0.018305   0.008658  -2.114   0.0388 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4467 on 58 degrees of freedom
Multiple R-squared:  0.07156,   Adjusted R-squared:  0.05555 
F-statistic: 4.471 on 1 and 58 DF,  p-value: 0.03879

a

Code
ggplot(data=student.survey, aes(x=re, y=pi)) +
  geom_jitter(data=student.survey, aes(x=re, y=pi), color="blue") +
  geom_abline(intercept=0.9308, slope=0.9704) +
  geom_smooth(method='lm')
`geom_smooth()` using formula 'y ~ x'

From the above visualization, we can also say that as religiosity increases, political ideology becomes more conservative.

Code
ggplot(data=student.survey, aes(x=tv, y=hi)) +
  geom_point(data=student.survey, aes(x=tv, y=hi), color="blue") +
  geom_abline(intercept=3.441353, slope=-0.018305) +
  geom_smooth(method='lm')
`geom_smooth()` using formula 'y ~ x'

From the above visualization, we can say that as the number of hours of TV watching increases, the high school GPA of the student decreases.

b

In (i), ~33.6% of the variance in political ideology is explained by religiosity and with 1 unit increase in religiosity, political ideology increases approximately by 0.97 units. We can also see that the p-value for (i) is way below the significance threshold of 0.05 and is therefore statistically significant. Similarly, in (ii), only ~7.2% of the variance in high school GPA is explained by the hours of TV watching. As expected, the p-value for (ii) is less than the usual 0.05 significance threshold but is close to it so it does not exhibit strong statistical significance.