hw3
homework3
abigailbalint
Author

Abigail Balint

Published

April 11, 2023

Code
library(tidyverse)
library(ggplot2)
library(dplyr)
library(readxl)
library(alr4)
library(smss)
knitr::opts_chunk$set(echo = TRUE)

Question 1

  1. The predictor is the ppgdp variable and the response is fertility.
  2. Scatterplot below: A straight line function here doesn’t make sense because at the low end of GDP the fertility rates have a huge range.
Code
ggplot(data = UN11, aes(x = ppgdp, y = fertility)) +
  geom_point() +
  geom_smooth(method = 'lm', se=F)
`geom_smooth()` using formula 'y ~ x'

  1. Scatterplot with log applied:

Applying the log smooths out the graphs bringing the distribution closer to a straight line.

Code
ggplot(data = UN11, aes(x = log(ppgdp), y = log(fertility))) +
  geom_point() +
  geom_smooth(method = 'lm', se=F)
`geom_smooth()` using formula 'y ~ x'

Question 2

  1. The slope would not change because all of values in the explanatory variable are being adjusted at exactly the same rate so the way that they impact any response variables is not going to change. The slope equation is unaffected. Similarly, b), the correlation would not change either because the impact is equal across all values.

Question 3

I used the pairs function to generate a matrix of scatterplots for every combination of variables. I can see that the bottom right quadrant is most correlated. This shows that the stream runoff at BSAAM is most correlated to the precipitation measurements at OPBPC, OPRC, and OPSLAKE.

Code
pairs(water)

Question 4

First I am showing the head of the data set to see what the five variable names are.

Code
head(Rateprof)
  gender numYears numRaters numCourses pepper discipline              dept
1   male        7        11          5     no        Hum           English
2   male        6        11          5     no        Hum Religious Studies
3   male       10        43          2     no        Hum               Art
4   male       11        24          5     no        Hum           English
5   male       11        19          7     no        Hum           Spanish
6   male       10        15          9     no        Hum           Spanish
   quality helpfulness  clarity easiness raterInterest sdQuality sdHelpfulness
1 4.636364    4.636364 4.636364 4.818182      3.545455 0.5518564     0.6741999
2 4.318182    4.545455 4.090909 4.363636      4.000000 0.9020179     0.9341987
3 4.790698    4.720930 4.860465 4.604651      3.432432 0.4529343     0.6663898
4 4.250000    4.458333 4.041667 2.791667      3.181818 0.9325048     0.9315329
5 4.684211    4.684211 4.684211 4.473684      4.214286 0.6500112     0.8200699
6 4.233333    4.266667 4.200000 4.533333      3.916667 0.8632717     1.0327956
  sdClarity sdEasiness sdRaterInterest
1 0.5045250  0.4045199       1.1281521
2 0.9438798  0.5045250       1.0744356
3 0.4129681  0.5407021       1.2369438
4 0.9990938  0.5882300       1.3322506
5 0.5823927  0.6117753       0.9749613
6 0.7745967  0.6399405       0.6685579

Then I am generating a matrix with those variables using pairs function.

Looking at the results, it’s interesting to see that rating doesn’t actually correlate with either of the four other variables that heavily. The only variables that seem to be correlated in a semi straight line are quality and helpfullness (most correlated), clarity and quality, and helpfulness and clarity.

Code
pairs(~ quality + helpfulness + clarity + easiness + raterInterest, data=Rateprof)

Question 5

For the religion vs political ideology chart, I can see that those who attended church never or occasionally are more likely to be on the liberal side of the spectrum, but the sample is more spread out for those who attend church often or every week. It looks like there is more likely to be a correlation with little/no church and being liberal than with lots of church and being conservative.

Code
?student.survey
data("student.survey")
ggplot(data = student.survey, aes(x = re, y = pi)) +
  geom_point(position = "jitter")+
  geom_smooth(method = "lm") 
`geom_smooth()` using formula 'y ~ x'

This graph shows a pretty strong correlation with more hours of TV correlating to a lower high school GPA. Similarly to the previous graph, the correlation is strongest with zero hours and a high GPA, and the distribution widens as the hours of TV per week increase.

Code
data("student.survey")
ggplot(data = student.survey, aes(x = tv, y = hi)) +
  geom_smooth(method = "lm")
`geom_smooth()` using formula 'y ~ x'