Code
library(tidyverse)
library(ggplot2)
library(dplyr)
library(readxl)
library(alr4)
library(smss)
::opts_chunk$set(echo = TRUE) knitr
Abigail Balint
April 11, 2023
`geom_smooth()` using formula 'y ~ x'
Applying the log smooths out the graphs bringing the distribution closer to a straight line.
I used the pairs function to generate a matrix of scatterplots for every combination of variables. I can see that the bottom right quadrant is most correlated. This shows that the stream runoff at BSAAM is most correlated to the precipitation measurements at OPBPC, OPRC, and OPSLAKE.
First I am showing the head of the data set to see what the five variable names are.
gender numYears numRaters numCourses pepper discipline dept
1 male 7 11 5 no Hum English
2 male 6 11 5 no Hum Religious Studies
3 male 10 43 2 no Hum Art
4 male 11 24 5 no Hum English
5 male 11 19 7 no Hum Spanish
6 male 10 15 9 no Hum Spanish
quality helpfulness clarity easiness raterInterest sdQuality sdHelpfulness
1 4.636364 4.636364 4.636364 4.818182 3.545455 0.5518564 0.6741999
2 4.318182 4.545455 4.090909 4.363636 4.000000 0.9020179 0.9341987
3 4.790698 4.720930 4.860465 4.604651 3.432432 0.4529343 0.6663898
4 4.250000 4.458333 4.041667 2.791667 3.181818 0.9325048 0.9315329
5 4.684211 4.684211 4.684211 4.473684 4.214286 0.6500112 0.8200699
6 4.233333 4.266667 4.200000 4.533333 3.916667 0.8632717 1.0327956
sdClarity sdEasiness sdRaterInterest
1 0.5045250 0.4045199 1.1281521
2 0.9438798 0.5045250 1.0744356
3 0.4129681 0.5407021 1.2369438
4 0.9990938 0.5882300 1.3322506
5 0.5823927 0.6117753 0.9749613
6 0.7745967 0.6399405 0.6685579
Then I am generating a matrix with those variables using pairs function.
Looking at the results, it’s interesting to see that rating doesn’t actually correlate with either of the four other variables that heavily. The only variables that seem to be correlated in a semi straight line are quality and helpfullness (most correlated), clarity and quality, and helpfulness and clarity.
For the religion vs political ideology chart, I can see that those who attended church never or occasionally are more likely to be on the liberal side of the spectrum, but the sample is more spread out for those who attend church often or every week. It looks like there is more likely to be a correlation with little/no church and being liberal than with lots of church and being conservative.
`geom_smooth()` using formula 'y ~ x'
This graph shows a pretty strong correlation with more hours of TV correlating to a lower high school GPA. Similarly to the previous graph, the correlation is strongest with zero hours and a high GPA, and the distribution widens as the hours of TV per week increase.
---
title: "Homework 3"
author: "Abigail Balint"
desription: "HW3 Responses"
date: "04/11/23"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- hw3
- homework3
- abigailbalint
---
```{r}
#| label: setup
#| warning: false
library(tidyverse)
library(ggplot2)
library(dplyr)
library(readxl)
library(alr4)
library(smss)
knitr::opts_chunk$set(echo = TRUE)
```
## Question 1
a) The predictor is the ppgdp variable and the response is fertility.
b) Scatterplot below:
A straight line function here doesn't make sense because at the low end of GDP the fertility rates have a huge range.
```{r}
ggplot(data = UN11, aes(x = ppgdp, y = fertility)) +
geom_point() +
geom_smooth(method = 'lm', se=F)
```
c) Scatterplot with log applied:
Applying the log smooths out the graphs bringing the distribution closer to a straight line.
```{r}
ggplot(data = UN11, aes(x = log(ppgdp), y = log(fertility))) +
geom_point() +
geom_smooth(method = 'lm', se=F)
```
# Question 2
a) The slope would not change because all of values in the explanatory variable are being adjusted at exactly the same rate so the way that they impact any response variables is not going to change. The slope equation is unaffected. Similarly, b), the correlation would not change either because the impact is equal across all values.
# Question 3
I used the pairs function to generate a matrix of scatterplots for every combination of variables. I can see that the bottom right quadrant is most correlated. This shows that the stream runoff at BSAAM is most correlated to the precipitation measurements at OPBPC, OPRC, and OPSLAKE.
```{r}
pairs(water)
```
# Question 4
First I am showing the head of the data set to see what the five variable names are.
```{r}
head(Rateprof)
```
Then I am generating a matrix with those variables using pairs function.
Looking at the results, it's interesting to see that rating doesn't actually correlate with either of the four other variables that heavily. The only variables that seem to be correlated in a semi straight line are quality and helpfullness (most correlated), clarity and quality, and helpfulness and clarity.
```{r}
pairs(~ quality + helpfulness + clarity + easiness + raterInterest, data=Rateprof)
```
# Question 5
For the religion vs political ideology chart, I can see that those who attended church never or occasionally are more likely to be on the liberal side of the spectrum, but the sample is more spread out for those who attend church often or every week. It looks like there is more likely to be a correlation with little/no church and being liberal than with lots of church and being conservative.
```{r}
?student.survey
data("student.survey")
ggplot(data = student.survey, aes(x = re, y = pi)) +
geom_point(position = "jitter")+
geom_smooth(method = "lm")
```
This graph shows a pretty strong correlation with more hours of TV correlating to a lower high school GPA. Similarly to the previous graph, the correlation is strongest with zero hours and a high GPA, and the distribution widens as the hours of TV per week increase.
```{r}
data("student.survey")
ggplot(data = student.survey, aes(x = tv, y = hi)) +
geom_smooth(method = "lm")
```