Code
library(tidyverse)
library(readxl)
library(ggplot2)
library(dplyr)
library(tidyr)
library(alr4)
library(smss)
::opts_chunk$set(echo = TRUE) knitr
Liam Tucksmith
April 11, 2023
Predictor: ppgpd Response: Fertility
No, a straight-line mean function doesn’t seem plausible for this graph as ppgdp isn’t evenly distributed across the x-asix, as seen in the scatter plot. As fewer data are found further along the x-axis, a straight-line mean function wouldn’t accurately summarize the data.
region group fertility ppgdp lifeExpF pctUrban
Afghanistan Asia other 5.968 499.0 49.49 23
Albania Europe other 1.525 3677.2 80.40 53
Algeria Africa africa 2.142 4473.0 75.00 67
Angola Africa africa 5.135 4321.9 53.17 59
Anguilla Caribbean other 2.000 13750.1 81.10 100
Argentina Latin Amer other 2.172 9162.1 79.89 93
Yes, a simple linear regression model seems plausible for this graph as the data is structured in linear fashion, showing that as fertility goes up, ppgpd goes down.
region group fertility ppgdp lifeExpF pctUrban
Afghanistan Asia other 5.968 499.0 49.49 23
Albania Europe other 1.525 3677.2 80.40 53
Algeria Africa africa 2.142 4473.0 75.00 67
Angola Africa africa 5.135 4321.9 53.17 59
Anguilla Caribbean other 2.000 13750.1 81.10 100
Argentina Latin Amer other 2.172 9162.1 79.89 93
The slope of the prediction equation would change as the currency conversion would change the range of income on the axis, which in turn the shape of the data and the regression line.
The correlation of the prediction equation wouldn’t change as the currency conversion is a linear change, not affecting the magnitude of a correlation between the variables.
From the below scatter plot matrix, the measurements for the sites don’t change drastically as the years age, but many do take a similar decline in the more recent years, while the ones who don’t decline in recent years have instead slightly inclined. Particularly, the APMAM, APSAB, APSLAKE, all trend upwards from ~1980 on, while OPBPC, OPRC, OPSLAKE, and BSAAM all trend downward in the same time period. We can also see from the matrix that the sites measurement comparisons within these two groups are clustered together and aligned seemingly linear, while sites comparisons across the two groups do not have a visable relationship, linear or otherwise.
Year APMAM APSAB APSLAKE OPBPC OPRC OPSLAKE BSAAM
1 1948 9.13 3.58 3.91 4.10 7.43 6.47 54235
2 1949 5.28 4.82 5.20 7.55 11.11 10.26 67567
3 1950 4.20 3.77 3.67 9.52 12.20 11.35 66161
4 1951 4.60 4.46 3.93 11.14 15.15 11.13 68094
5 1952 7.15 4.99 4.88 16.34 20.05 22.81 107080
6 1953 9.70 5.65 4.91 8.88 8.15 7.41 67594
Of the five variables, easiness and raterInterst appear to be the least correlated with the other three variables.We can see this as in the scatterplot matrix, the easiness and raterInterest plots are the least linear. On the other hand, the quality, helfulness, and clarity ratings appear to be strongly correlated as their scatterplots form a clearly linear shape.
gender numYears numRaters numCourses pepper discipline dept
1 male 7 11 5 no Hum English
2 male 6 11 5 no Hum Religious Studies
3 male 10 43 2 no Hum Art
4 male 11 24 5 no Hum English
5 male 11 19 7 no Hum Spanish
6 male 10 15 9 no Hum Spanish
quality helpfulness clarity easiness raterInterest sdQuality sdHelpfulness
1 4.636364 4.636364 4.636364 4.818182 3.545455 0.5518564 0.6741999
2 4.318182 4.545455 4.090909 4.363636 4.000000 0.9020179 0.9341987
3 4.790698 4.720930 4.860465 4.604651 3.432432 0.4529343 0.6663898
4 4.250000 4.458333 4.041667 2.791667 3.181818 0.9325048 0.9315329
5 4.684211 4.684211 4.684211 4.473684 4.214286 0.6500112 0.8200699
6 4.233333 4.266667 4.200000 4.533333 3.916667 0.8632717 1.0327956
sdClarity sdEasiness sdRaterInterest
1 0.5045250 0.4045199 1.1281521
2 0.9438798 0.5045250 1.0744356
3 0.4129681 0.5407021 1.2369438
4 0.9990938 0.5882300 1.3322506
5 0.5823927 0.6117753 0.9749613
6 0.7745967 0.6399405 0.6685579
Graphically portray how the explanatory variable relates to the outcome variable in each of the two cases
Summarize and interpret results of inferential analyses.
In the first grap, we are looking at a relationship between a categorical predictor and response variable. I used geom_count so that overlapping data points wouldn’t affect the appearance of the relationship. From the graph, we can see that liberal and very liberal repsondents attend church the least frequently. Those who attended religious services most frequently have a range of political beliefs, a higher concentration of politically like-minded folks attending church more often isn’t seen in this dataset.
Looking at the other graph, I used geom_smooth as this graph looks at the relationship between two continuous variables. From the graph, there is not a strong linear pattern, indicating that the relationship between these two variables isn’t linear. However, we can see that the most tv was watched by respondents with a GPA under 3.0, and respondents with a GPA above 3.0 watched the least tv.
---
title: "HW3"
author: "Liam Tucksmith"
desription: "HW3"
date: "04/11/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- hw3
- Liam Tucksmith
- tidyverse
- readxl
- ggplot2
- dplyr
- tidyr
editor_options:
markdown:
wrap: 72
---
```{r, echo=T}
#| label: setup
#| warning: false
library(tidyverse)
library(readxl)
library(ggplot2)
library(dplyr)
library(tidyr)
library(alr4)
library(smss)
knitr::opts_chunk$set(echo = TRUE)
```
1. United Nations (Data file: UN11in alr4) The data in the file UN11 contains several variables,
including ppgdp, the gross national product per person in U.S. dollars, and fertility, the birth
rate per 1000 females, both from the year 2009. The data are for 199 localities, mostly UN
member countries, but also other areas such as Hong Kong that are not independent countries.
The data were collected from the United Nations (2011). We will study the dependence of
fertility on ppgdp.
(a) Identify the predictor and the response.
Predictor: ppgpd
Response: Fertility
(b) Draw the scatter plot of fertility on the vertical axis versus ppgdp on the horizontal axis
and summarize the information in this graph. Does a straight-line mean function seem to
be plausible for a summary of this graph?
No, a straight-line mean function doesn't seem plausible for this graph as ppgdp isn't evenly distributed across the x-asix, as seen in the scatter plot. As fewer data are found further along the x-axis, a straight-line mean function wouldn't accurately summarize the data.
```{r}
un <- data.frame(UN11)
head(un)
ggplot(un, aes(x=ppgdp, y=fertility)) + geom_point()
```
(c) Draw the scatterplot of log(fertility) versus log(ppgdp) using natural logarithms. Does
the simple linear regression model seem plausible for a summary of this graph? If you use
a different base of logarithms, the shape of the graph won’t change, but the values on the
axes will change.
Yes, a simple linear regression model seems plausible for this graph as the data is structured in linear fashion, showing that as fertility goes up, ppgpd goes down.
```{r}
un <- data.frame(UN11)
head(un)
ggplot(un, aes(x=log(ppgdp), y=log(fertility))) + geom_point()
```
2. Annual income, in dollars, is an explanatory variable in a regression analysis. For a British
version of the report on the analysis, all responses are converted to British pounds sterling (1 pound
equals about 1.33 dollars, as of 2016).
(a) How, if at all, does the slope of the prediction equation change?
The slope of the prediction equation would change as the currency conversion would change the range of income on the axis, which in turn the shape of the data and the regression line.
(b) How, if at all, does the correlation change?
The correlation of the prediction equation wouldn't change as the currency conversion is a linear change, not affecting the magnitude of a correlation between the variables.
3. Water runoff in the Sierras (Data file: water in alr4) Can Southern California’s water
supply in future years be predicted from past data? One factor affecting water availability is stream
runoff. If runoff could be predicted, engineers, planners, and policy makers could do their jobs
more efficiently. The data file contains 43 years’ worth of precipitation measurements taken at six
sites in the Sierra Nevada mountains (labeled APMAM, APSAB, APSLAKE, OPBPC, OPRC, and
OPSLAKE) and stream runoff volume at a site near Bishop, California, labeled BSAAM. Draw
the scatterplot matrix for these data and summarize the information available from these
plots. (Hint: Use the pairs() function.)
From the below scatter plot matrix, the measurements for the sites don't change drastically as the years age, but many do take a similar decline in the more recent years, while the ones who don't decline in recent years have instead slightly inclined. Particularly, the APMAM, APSAB, APSLAKE, all trend upwards from ~1980 on, while OPBPC, OPRC, OPSLAKE, and BSAAM all trend downward in the same time period. We can also see from the matrix that the sites measurement comparisons within these two groups are clustered together and aligned seemingly linear, while sites comparisons across the two groups do not have a visable relationship, linear or otherwise.
```{r}
water <- data.frame(water)
head(water)
pairs(water)
```
4. Professor ratings (Data file: Rateprof in alr4) In the website and online forum
RateMyProfessors.com, students rate and comment on their instructors. Launched in 1999, the site
includes millions of ratings on thousands of instructors. The data file includes the summaries of
the ratings of 364 instructors at a large campus in the Midwest (Bleske-Rechek and Fritsch, 2011).
Each instructor included in the data had at least 10 ratings over a several year period. Students
provided ratings of 1–5 on quality, helpfulness, clarity, easiness of instructor’s courses, and
raterInterest in the subject matter covered in the instructor’s courses. The data file provides the
averages of these five ratings. Create a scatterplot matrix of these five variables. Provide a
brief description of the relationships between the five ratings.
Of the five variables, easiness and raterInterst appear to be the least correlated with the other three variables.We can see this as in the scatterplot matrix, the easiness and raterInterest plots are the least linear. On the other hand, the quality, helfulness, and clarity ratings appear to be strongly correlated as their scatterplots form a clearly linear shape.
```{r}
rate <- data.frame(Rateprof)
head(rate)
ratings <- select(rate, c('quality','helpfulness','clarity', 'easiness', 'raterInterest'))
pairs(ratings)
```
5. For the student.survey data file in the smss package, conduct regression analyses relating
(by convention, y denotes the outcome variable, x denotes the explanatory variable)
(i) y = political ideology and x = religiosity,
(ii) y = high school GPA and x = hours of TV watching.
(You can use ?student.survey in the R console, after loading the package, to see what each variable
means.)
(a) Graphically portray how the explanatory variable relates to the outcome variable in
each of the two cases
(b) Summarize and interpret results of inferential analyses.
In the first grap, we are looking at a relationship between a categorical predictor and response variable. I used geom_count so that overlapping data points wouldn't affect the appearance of the relationship. From the graph, we can see that liberal and very liberal repsondents attend church the least frequently. Those who attended religious services most frequently have a range of political beliefs, a higher concentration of politically like-minded folks attending church more often isn't seen in this dataset.
Looking at the other graph, I used geom_smooth as this graph looks at the relationship between two continuous variables. From the graph, there is not a strong linear pattern, indicating that the relationship between these two variables isn't linear. However, we can see that the most tv was watched by respondents with a GPA under 3.0, and respondents with a GPA above 3.0 watched the least tv.
```{r}
ggplot(student.survey, aes(re, pi)) + geom_count()
ggplot(student.survey, aes(hi, tv)) + geom_smooth()
```