hw3
Ken Docekal
Author

Ken Docekal

Published

October 31, 2022

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE)

Q1

Loading in data

Code
library(alr4) 
Loading required package: car
Loading required package: carData

Attaching package: 'car'
The following object is masked from 'package:dplyr':

    recode
The following object is masked from 'package:purrr':

    some
Loading required package: effects
lattice theme set by effectsTheme()
See ?effectsTheme for details.
Code
library(smss) 
Warning: package 'smss' was built under R version 4.2.2
Code
data('UN11', package = 'alr4')
Code
head(UN11)
                region  group fertility   ppgdp lifeExpF pctUrban
Afghanistan       Asia  other     5.968   499.0    49.49       23
Albania         Europe  other     1.525  3677.2    80.40       53
Algeria         Africa africa     2.142  4473.0    75.00       67
Angola          Africa africa     5.135  4321.9    53.17       59
Anguilla     Caribbean  other     2.000 13750.1    81.10      100
Argentina   Latin Amer  other     2.172  9162.1    79.89       93

1.1.1

The predictor variable is ppgdp (Per Person Gross National Product) and the response variable is fertility (Birth Rate Per 1000 Females).

1.1.2

Plotting the scatterplot of fertility versus ppgdp shows a large initial reduction in fertility as GDP increase which quickly levels out with little subsequent change in fertility as GDP continues to increase.A straight-line mean function seems like an implausible fit.

Code
plot(x = UN11$ppgdp, y = UN11$fertility)

1.1.3

A simple liner regression model seems more plausible when using natural log, the plot now seems like a good fit for a negative linear regression line.

Code
plot(x = log(UN11$ppgdp), y = log(UN11$fertility))

Q2

a

Changing the unit of the explanatory variable from currency in dollars to pound, which is worth more than the equivalent amount in dollars in 2016, leads to a smaller coefficient for annual income and therefore decrease the slope in a linear regression.

b

Correlation, as a standardized version of the slope, does not rely on unit of measurement however and will not be affected by an change in currency denomination.

Q3

Loading in water data.

Code
data('water', package = 'alr4')

Looking at the scatter plot matrix shows relationships between variables for year, precipitation at six Sierra Nevada mountain sites - APMAM, APSAB, APSLAKE, OPBPC, OPRC, and OPSLAKE, and stream runoff volume near Bishop CA - BSAAM.

First column shows precipitation then runoff (y-axis) by year(x-axis). We can see that for most sites precipitation has a somewhat wide distribution confined towards the lower end of the range with some outliners. OPRC precipitation is has greater spread while OPSLAKE and BSAAM show somewhat of a convex relationship.

Looking at the last row comparing runoff (y-axis) to precipitation (x-axis); there is minimal correlation between precipitation from APMAM, APSAB, and APSLAKE and runoff levels but a strong positive linear correlation with precipitation from OPBPC, OPRC, and OPSLAKE sites. OPBPC, OPRC, and OPSLAKE sites’ greater correlation with BSAAM implies that these sites may be closer or more influential to stream runoff volume near Bishop CA.

When focusing on the relationships between precipitation across sites there seems to be two groupings with high correlations; APMAM, APSAB, and APSLAKE all show fairly strong positive linear relationships with each other as do OPBPC, OPRC, and OPSLAKE. Across these groups of variables however the relationship is less clear and values are generally clustered among the lower values.This implies that sites based on these two groupings may share closer geographic proximity and are therefore more similarly affected by precipitation levels.

Code
pairs(water)

Q4

Loading in Rateprof data.

Code
data('Rateprof', package = 'alr4')

Specifying rating variables - quality, helpfulness, clarity, easiness, raterInterest.

Code
Rateprof1 = subset(Rateprof, select = c(quality, helpfulness, clarity, easiness, raterInterest))

head(Rateprof1)
   quality helpfulness  clarity easiness raterInterest
1 4.636364    4.636364 4.636364 4.818182      3.545455
2 4.318182    4.545455 4.090909 4.363636      4.000000
3 4.790698    4.720930 4.860465 4.604651      3.432432
4 4.250000    4.458333 4.041667 2.791667      3.181818
5 4.684211    4.684211 4.684211 4.473684      4.214286
6 4.233333    4.266667 4.200000 4.533333      3.916667

The scatter plot matrix shows very strong positive linear relationships between quality, helpfulness, and clarity. Easiness is also positively correlated with these three variables but less strongly. raterInterest is even less strongly correlated with quality, helpfulness, clarity, and easiness (especially with easiness) but there still seems to be a slightly positive linear relationship.

Code
pairs(Rateprof1)

Q5

Loading in student.survey data.

Code
data('student.survey', package = 'smss')

Reviewing variables; pi is political ideology, re is religiosity, hi is high school GPA, and tv is average hours of TV watching per week.

Code
?student.survey
starting httpd help server ... done
Code
view(student.survey)

i

Regression analysis of y = political ideology and x = religiosity.

Code
lm(pi ~ re, data = student.survey)
Warning in model.response(mf, "numeric"): using type = "numeric" with a factor
response will be ignored
Warning in Ops.ordered(y, z$residuals): '-' is not meaningful for ordered
factors

Call:
lm(formula = pi ~ re, data = student.survey)

Coefficients:
(Intercept)         re.L         re.Q         re.C  
     3.5253       2.1864       0.1049      -0.6958  

ii

Regression analysis of y = high school GPA and x = hours of TV watching.

Code
lm(hi ~ tv, data = student.survey)

Call:
lm(formula = hi ~ tv, data = student.survey)

Coefficients:
(Intercept)           tv  
    3.44135     -0.01831  

a

Plot of y = political ideology and x = religiosity.

Code
ggplot(data = student.survey, aes(x = re, y = pi)) +
  geom_point() +
  geom_smooth(method = 'lm')
`geom_smooth()` using formula 'y ~ x'

Plot of y = high school GPA and x = hours of TV watching.

Code
ggplot(data = student.survey, aes(x = tv, y = hi)) +
  geom_point() +
  geom_smooth(method = 'lm')
`geom_smooth()` using formula 'y ~ x'

b

Analysis of the relationship between political ideology and religiosity shows a weak positive linear relationship between the two variables. As religious service attendance increases political conservatism also increases while greater political liberalism is associated with lower attendance.

The relationship between high school GPA and hours of TV watching shows a slightly negative linear relationship. Increased hours watching TV is correlated with a small decrease in high school GPA.