Homework 3 - Prahitha Movva

hw3

regression

correlation

covariance

The third homework

Author

Prahitha Movva

Published

October 31, 2022

Code

library(readxl)
library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.3.6      ✔ purrr   0.3.5 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.4.1 
✔ readr   2.1.3      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Code

library(dplyr)
library(stats)
library(alr4)

Loading required package: car
Loading required package: carData

Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some

Loading required package: effects
lattice theme set by effectsTheme()
See ?effectsTheme for details.

Code

library(smss)

Warning: package 'smss' was built under R version 4.2.2

Code

knitr::opts_chunk$set(echo=TRUE, warning=FALSE)

Question 1

1.1.1

The predictor is ppgdp and the response is fertility because we are studying the dependence of fertility on ppgdp.

1.1.2

Code

ggplot(data=UN11, aes(x=ppgdp, y=fertility))+geom_point()

There are very few observations with gross national product per person between 20,000 and 100,000 when compared to those below 20,000 (in addition to the huge dip in fertility). Since the above graph does not seem to exhibit a linear relationship, it is not plausible for it to have a straight-line mean function.

1.1.3

Code

ggplot(data=UN11, aes(x=log(ppgdp), y=log(fertility)))+geom_point()

We can see that as log(ppgdp) increases, log(fertility) decreases linearly. So, we can say that the relationship between these two variables is linear and a simple linear regression model would be plausible.

Question 2

a

The slope gets divided by 1.33 due to the conversion from US Dollar to British Pound. So the slope of the British prediction equation will be ~25% less than the US prediction equation. We can also observe this from the below graph, where the British equation is represented in red and the US in blue.

Code

UN11$british.ppgdp <- UN11$ppgdp/1.33

ggplot() +
  geom_smooth(data=UN11, aes(x = british.ppgdp, y = fertility), 
              method = "lm", se = FALSE, color = "red") + 
  geom_smooth(data=UN11, aes(x = ppgdp, y = fertility), 
              method = "lm", se = FALSE, color = "blue") + 
  geom_point(data=UN11, aes(x = british.ppgdp, y = fertility), color = "red") + 
  geom_point(data=UN11, aes(x = ppgdp, y = fertility), color = "blue")

`geom_smooth()` using formula 'y ~ x'
`geom_smooth()` using formula 'y ~ x'

b

The correlation between the explanatory variable and the response will not change as both will increase 1.33x

Code

UN11$british.ppgdp <- UN11$ppgdp/1.33

cor(UN11$british.ppgdp, UN11$fertility)

[1] -0.4399891

Code

cor(UN11$ppgdp, UN11$fertility)

[1] -0.4399891

Question 3

Code

data(water)
pairs(water)

From the matrix, we can see that the first three locations - APMAM, APSAB and APSLAKE - are strongly correlated with each other. Similarly, the next three location - OPBPC, OPRC and OPSLAKE - are also correlated with each other. However, the first three locations do not share a strong correlation with the response - BSAAM, whereas the next three locations do. This implies that using one of the last three locations will give a better fit/ predictions for the stream runoff volume. On close inspection, we can further say that OPSLAKE might be a better choice among the last three locations to predict BSAAM.

Code

summary(lm(water$BSAAM ~ water$OPBPC, data=water))$adj.r.squared

[1] 0.7792942

Code

summary(lm(water$BSAAM ~ water$OPRC, data=water))$adj.r.squared

[1] 0.8419507

Code

summary(lm(water$BSAAM ~ water$OPSLAKE, data=water))$adj.r.squared

[1] 0.8777515

Question 4

Code

data(Rateprof)
head(Rateprof)

  gender numYears numRaters numCourses pepper discipline              dept
1   male        7        11          5     no        Hum           English
2   male        6        11          5     no        Hum Religious Studies
3   male       10        43          2     no        Hum               Art
4   male       11        24          5     no        Hum           English
5   male       11        19          7     no        Hum           Spanish
6   male       10        15          9     no        Hum           Spanish
   quality helpfulness  clarity easiness raterInterest sdQuality sdHelpfulness
1 4.636364    4.636364 4.636364 4.818182      3.545455 0.5518564     0.6741999
2 4.318182    4.545455 4.090909 4.363636      4.000000 0.9020179     0.9341987
3 4.790698    4.720930 4.860465 4.604651      3.432432 0.4529343     0.6663898
4 4.250000    4.458333 4.041667 2.791667      3.181818 0.9325048     0.9315329
5 4.684211    4.684211 4.684211 4.473684      4.214286 0.6500112     0.8200699
6 4.233333    4.266667 4.200000 4.533333      3.916667 0.8632717     1.0327956
  sdClarity sdEasiness sdRaterInterest
1 0.5045250  0.4045199       1.1281521
2 0.9438798  0.5045250       1.0744356
3 0.4129681  0.5407021       1.2369438
4 0.9990938  0.5882300       1.3322506
5 0.5823927  0.6117753       0.9749613
6 0.7745967  0.6399405       0.6685579

Code

rates <- Rateprof %>%
  select(quality, helpfulness, clarity, easiness, raterInterest)
pairs(rates)

All of them have a positive correlation but with different magnitudes. Quality, helpfulness and clarity seem to have a linear relationship with one another with a strong positive correlation. Quality, with easiness and rater interest has a positive correlation but is weak (weaker for rater interest). The plot for easiness and rater interest seems flat amongst all, implying almost no (very weak) correlation.

Question 5

Code

data(student.survey)
pi.clean <- as.numeric(student.survey$pi)
re.clean <- as.numeric(student.survey$re)
model.i <- lm(pi.clean ~ re.clean, data=student.survey)
summary(model.i)


Call:
lm(formula = pi.clean ~ re.clean, data = student.survey)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.81243 -0.87160  0.09882  1.12840  3.09882 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.9308     0.4252   2.189   0.0327 *  
re.clean      0.9704     0.1792   5.416 1.22e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.345 on 58 degrees of freedom
Multiple R-squared:  0.3359,    Adjusted R-squared:  0.3244 
F-statistic: 29.34 on 1 and 58 DF,  p-value: 1.221e-06

Code

model.ii <- lm(hi ~ tv, data = student.survey)
summary(model.ii)


Call:
lm(formula = hi ~ tv, data = student.survey)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.2583 -0.2456  0.0417  0.3368  0.7051 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.441353   0.085345  40.323   <2e-16 ***
tv          -0.018305   0.008658  -2.114   0.0388 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4467 on 58 degrees of freedom
Multiple R-squared:  0.07156,   Adjusted R-squared:  0.05555 
F-statistic: 4.471 on 1 and 58 DF,  p-value: 0.03879

a

Code

ggplot(data=student.survey, aes(x=re, y=pi)) +
  geom_jitter(data=student.survey, aes(x=re, y=pi), color="blue") +
  geom_abline(intercept=0.9308, slope=0.9704) +
  geom_smooth(method='lm')

`geom_smooth()` using formula 'y ~ x'

From the above visualization, we can also say that as religiosity increases, political ideology becomes more conservative.

Code

ggplot(data=student.survey, aes(x=tv, y=hi)) +
  geom_point(data=student.survey, aes(x=tv, y=hi), color="blue") +
  geom_abline(intercept=3.441353, slope=-0.018305) +
  geom_smooth(method='lm')

`geom_smooth()` using formula 'y ~ x'

From the above visualization, we can say that as the number of hours of TV watching increases, the high school GPA of the student decreases.

b

In (i), ~33.6% of the variance in political ideology is explained by religiosity and with 1 unit increase in religiosity, political ideology increases approximately by 0.97 units. We can also see that the p-value for (i) is way below the significance threshold of 0.05 and is therefore statistically significant. Similarly, in (ii), only ~7.2% of the variance in high school GPA is explained by the hours of TV watching. As expected, the p-value for (ii) is less than the usual 0.05 significance threshold but is close to it so it does not exhibit strong statistical significance.