hw3
desriptive statistics
probability
Homework 3
Author

Caitlin Rowley

Published

April 11, 2023

Code
# load libraries
library(tidyr)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
Code
library(magrittr)

Attaching package: 'magrittr'
The following object is masked from 'package:tidyr':

    extract
Code
library(ggplot2)
library(markdown)
library(ggtext)
Warning: package 'ggtext' was built under R version 4.2.2
Code
library(readxl)
Warning: package 'readxl' was built under R version 4.2.2

Question 1

United Nations (Data file: UN11in alr4) The data in the file UN11 contains several variables,
including ppgdp, the gross national product per person in U.S. dollars, and fertility, the birth
rate per 1000 females, both from the year 2009. The data are for 199 localities, mostly UN
member countries, but also other areas such as Hong Kong that are not independent countries.
The data were collected from the United Nations (2011). We will study the dependence of
fertility on ppgdp.
(a) Identify the predictor and the response.
(b) Draw the scatterplot of fertility on the vertical axis versus ppgdp on the horizontal axis
and summarize the information in this graph. Does a straight-line mean function seem to
be plausible for a summary of this graph?
(c) Draw the scatterplot of log(fertility) versus log(ppgdp) using natural logarithms. Does
the simple linear regression model seem plausible for a summary of this graph? If you use
a different base of logarithms, the shape of the graph won’t change, but the values on the
axes will change.

Code
# read in data:

UN11 <- readRDS(url('https://github.com/omerfyalcin/colab-data/blob/main/UN.rds?raw=true'))
head(UN11)
                  region  group fertility   ppgdp lifeExpF pctUrban
Afghanistan         Asia  other     5.968   499.0    49.49       23
Albania           Europe  other     1.525  3677.2    80.40       53
Algeria           Africa africa     2.142  4473.0    75.00       67
American Samoa      <NA>   <NA>        NA      NA       NA       NA
Angola            Africa africa     5.135  4321.9    53.17       59
Anguilla       Caribbean  other     2.000 13750.1    81.10      100
               infantMortality
Afghanistan          124.53500
Albania               16.56100
Algeria               21.45800
American Samoa        11.29389
Angola                96.19100
Anguilla                    NA

A)

In studying the dependence of fertility on GDP, fertility is the predictor and GDP is the response.

B)

Code
# generate scatterplot:

scatter_1 <- ggplot(UN11, aes(x=ppgdp, y=fertility)) + 
  geom_point( color="black")
scatter_1
Warning: Removed 14 rows containing missing values (geom_point).

Code
# with linear trend
scatter_2 <- ggplot(UN11, aes(x=ppgdp, y=fertility)) +
  geom_point() +
  geom_smooth(method=lm, se=TRUE)
scatter_2
`geom_smooth()` using formula 'y ~ x'
Warning: Removed 14 rows containing non-finite values (stat_smooth).
Removed 14 rows containing missing values (geom_point).

In viewing this scatterplot, we can see that higher rates of birth correlates with lower per-person GPD. A straight-line mean function does not seem plausible for this graph, as the line extends far beyond the range of data points.

C)

Code
# generate scatterplot using natural logarithms:

scatter_3 <- ggplot(UN11, aes(x=log(ppgdp), y=log(fertility))) +
  geom_point() +
  geom_smooth(method=lm, se=TRUE)
scatter_3
`geom_smooth()` using formula 'y ~ x'
Warning: Removed 14 rows containing non-finite values (stat_smooth).
Warning: Removed 14 rows containing missing values (geom_point).

Using natural logarithms, we see a much more plausible summary for this graph.

Question 2

Annual income, in dollars, is an explanatory variable in a regression analysis. For a British
version of the report on the analysis, all responses are converted to British pounds sterling (1 pound
equals about 1.33 dollars, as of 2016).
(a) How, if at all, does the slope of the prediction equation change?
(b) How, if at all, does the correlation change?

A)

The slope of the prediction will need to be divided by 1.33 to account for the conversion to British pounds.

B)

The correlation will not change.

Question 3

Water runoff in the Sierras (Data file: water in alr4) Can Southern California’s water
supply in future years be predicted from past data? One factor affecting water availability is stream
runoff. If runoff could be predicted, engineers, planners, and policy makers could do their jobs
more efficiently. The data file contains 43 years’ worth of precipitation measurements taken at six
sites in the Sierra Nevada mountains (labeled APMAM, APSAB, APSLAKE, OPBPC, OPRC, and
OPSLAKE) and stream runoff volume at a site near Bishop, California, labeled BSAAM. Draw
the scatterplot matrix for these data and summarize the information available from these
plots. (Hint: Use the pairs() function.)

Code
library(alr4)
Warning: package 'alr4' was built under R version 4.2.3
Loading required package: car
Warning: package 'car' was built under R version 4.2.3
Loading required package: carData
Warning: package 'carData' was built under R version 4.2.3

Attaching package: 'car'
The following object is masked from 'package:dplyr':

    recode
Loading required package: effects
Warning: package 'effects' was built under R version 4.2.3
lattice theme set by effectsTheme()
See ?effectsTheme for details.

Attaching package: 'alr4'
The following object is masked _by_ '.GlobalEnv':

    UN11
Code
head(water)
  Year APMAM APSAB APSLAKE OPBPC  OPRC OPSLAKE  BSAAM
1 1948  9.13  3.58    3.91  4.10  7.43    6.47  54235
2 1949  5.28  4.82    5.20  7.55 11.11   10.26  67567
3 1950  4.20  3.77    3.67  9.52 12.20   11.35  66161
4 1951  4.60  4.46    3.93 11.14 15.15   11.13  68094
5 1952  7.15  4.99    4.88 16.34 20.05   22.81 107080
6 1953  9.70  5.65    4.91  8.88  8.15    7.41  67594
Code
pairs(water)

Based on the scatterplot matrix, stream runoff near Bishop, California appears to be correlated with precipitation at “O” sites, but not with “A” sites or year.

Question 4

Professor ratings (Data file: Rateprof in alr4) In the website and online forum
RateMyProfessors.com, students rate and comment on their instructors. Launched in 1999, the site
includes millions of ratings on thousands of instructors. The data file includes the summaries of
the ratings of 364 instructors at a large campus in the Midwest (Bleske-Rechek and Fritsch, 2011).
Each instructor included in the data had at least 10 ratings over a several year period. Students
provided ratings of 1–5 on quality, helpfulness, clarity, easiness of instructor’s courses, and
raterInterest in the subject matter covered in the instructor’s courses. The data file provides the
averages of these five ratings. Create a scatterplot matrix of these five variables. Provide a
brief description of the relationships between the five ratings.

Code
head(Rateprof)
  gender numYears numRaters numCourses pepper discipline              dept
1   male        7        11          5     no        Hum           English
2   male        6        11          5     no        Hum Religious Studies
3   male       10        43          2     no        Hum               Art
4   male       11        24          5     no        Hum           English
5   male       11        19          7     no        Hum           Spanish
6   male       10        15          9     no        Hum           Spanish
   quality helpfulness  clarity easiness raterInterest sdQuality sdHelpfulness
1 4.636364    4.636364 4.636364 4.818182      3.545455 0.5518564     0.6741999
2 4.318182    4.545455 4.090909 4.363636      4.000000 0.9020179     0.9341987
3 4.790698    4.720930 4.860465 4.604651      3.432432 0.4529343     0.6663898
4 4.250000    4.458333 4.041667 2.791667      3.181818 0.9325048     0.9315329
5 4.684211    4.684211 4.684211 4.473684      4.214286 0.6500112     0.8200699
6 4.233333    4.266667 4.200000 4.533333      3.916667 0.8632717     1.0327956
  sdClarity sdEasiness sdRaterInterest
1 0.5045250  0.4045199       1.1281521
2 0.9438798  0.5045250       1.0744356
3 0.4129681  0.5407021       1.2369438
4 0.9990938  0.5882300       1.3322506
5 0.5823927  0.6117753       0.9749613
6 0.7745967  0.6399405       0.6685579
Code
pairs(Rateprof[,8:12])

Quality: Based on the scatterplot matrix, we can see that there is a positive correlation between quality and helpfulness, as well as between quality and clarity (though we see one outlier in this data set that represents a comparitively higher rate of clarity linked with a lower rate of quality). There is a slightly positive correlation between quality and easiness, but no definitive correlation between quality and rater interest.

Helpfulness: Based on the scatterplot matrix, we see (as mentioned) a positive correlation between helpfulness and quality, as well as between helpfulness and clarity (though the spread of data points is slightly larger). Similar to quality, there is a slightly positive correlation between helpfulness and easiness, but no definitive correlation between helpfulness and rater interest.

Clarity: Based on the scatterplot, we see (as mentioned) a positive correlation between clarity and quality, as well as between clarity and helpfulness (though the spread of data points is slightly larger). Similarly to quality and helpfulness, there is no definitive correlation between clarity and easiness or clarity and rater interest.

Easiness: There are slightly positive correlations between easiness and quality, easiness and helpfulness, and easiness and clarity. There is no definitive correlation between easiness and rater interest.

Rater Interest: There are no definitive correlations between rater interest and any of the remaining four variables.

Overall, quality, helpfulness, and clarity provide more insight related to professor rating than easiness and rater interest.

Question 5

For the student.survey data file in the smss package, conduct regression analyses relating
(by convention, y denotes the outcome variable, x denotes the explanatory variable)
(i) y = political ideology and x = religiosity,
(ii) y = high school GPA and x = hours of TV watching.
(You can use ?student.survey in the R console, after loading the package, to see what each variable
means.)
(a) Graphically portray how the explanatory variable relates to the outcome variable in
each of the two cases
(b) Summarize and interpret results of inferential analyses.

A)

Code
library(smss)
Warning: package 'smss' was built under R version 4.2.3
Code
data(student.survey)
head(student.survey)
  subj ge ag  hi  co   dh   dr tv sp ne ah    ve pa           pi           re
1    1  m 32 2.2 3.5    0  5.0  3  5  0  0 FALSE  r conservative   most weeks
2    2  f 23 2.1 3.5 1200  0.3 15  7  5  6 FALSE  d      liberal occasionally
3    3  f 27 3.3 3.0 1300  1.5  0  4  3  0 FALSE  d      liberal   most weeks
4    4  f 35 3.5 3.2 1500  8.0  5  5  6  3 FALSE  i     moderate occasionally
5    5  m 23 3.1 3.5 1600 10.0  6  6  3  0 FALSE  i very liberal        never
6    6  m 39 3.5 3.5  350  3.0  4  5  7  0 FALSE  d      liberal occasionally
     ab    aa    ld
1 FALSE FALSE FALSE
2 FALSE FALSE    NA
3 FALSE FALSE    NA
4 FALSE FALSE FALSE
5 FALSE FALSE FALSE
6 FALSE FALSE    NA
Code
library(dplyr)
survey <- student.survey %>%
  rename("poli_ideo"="pi", "relig"="re", "TV"="tv", "gpa_hs"="hi")

lm(poli_ideo ~ relig, data=survey)
Warning in model.response(mf, "numeric"): using type = "numeric" with a factor
response will be ignored
Warning in Ops.ordered(y, z$residuals): '-' is not meaningful for ordered
factors

Call:
lm(formula = poli_ideo ~ relig, data = survey)

Coefficients:
(Intercept)      relig.L      relig.Q      relig.C  
     3.5253       2.1864       0.1049      -0.6958  
Code
ggplot(data = survey, aes(x = relig, fill = poli_ideo)) +
    geom_bar(position = "fill") + scale_fill_brewer(palette="PiYG")

Code
lm(gpa_hs ~ TV, data=survey)

Call:
lm(formula = gpa_hs ~ TV, data = survey)

Coefficients:
(Intercept)           TV  
    3.44135     -0.01831  
Code
ggplot(data = survey, aes(x=TV, y=gpa_hs)) +
    geom_point() + geom_smooth(method=lm, se=TRUE)
`geom_smooth()` using formula 'y ~ x'

B)

Code
install.packages("stargazer")
Installing package into 'C:/Users/caitr/AppData/Local/R/win-library/4.2'
(as 'lib' is unspecified)
Error in contrib.url(repos, "source"): trying to use CRAN without setting a mirror
Code
library(stargazer)

Please cite as: 
 Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
 R package version 5.2.3. https://CRAN.R-project.org/package=stargazer 
Code
poli_relig <- lm(as.numeric(poli_ideo) ~ as.numeric(relig), 
         data = survey)

gpa_tv <- lm(gpa_hs ~ TV, data = survey)

stargazer(poli_relig, gpa_tv, type = 'text', dep.var.labels = c('Political Ideology', 'High School GPA'), covariate.labels = c('Religiosity', 'Hours of TV'))

================================================================
                                     Dependent variable:        
                              ----------------------------------
                              Political Ideology High School GPA
                                     (1)               (2)      
----------------------------------------------------------------
Religiosity                        0.970***                     
                                   (0.179)                      
                                                                
Hours of TV                                         -0.018**    
                                                     (0.009)    
                                                                
Constant                           0.931**          3.441***    
                                   (0.425)           (0.085)    
                                                                
----------------------------------------------------------------
Observations                          60               60       
R2                                  0.336             0.072     
Adjusted R2                         0.324             0.056     
Residual Std. Error (df = 58)       1.345             0.447     
F Statistic (df = 1; 58)          29.336***          4.471**    
================================================================
Note:                                *p<0.1; **p<0.05; ***p<0.01

Based on this inferential analysis, we can see that there is a statistically significant and positive correlation between religiosity and political ideology (p-value<0.01). Specifically, there is for every “one unit of increase” in religiosity, there is a “0.970-unit increase” in political ideology. However, the adjusted R-squared value is 0.324, which suggests that a linear regression is not a very good fit for this model.

However, we also see that there is a statistically significant negative correlation between hours of TV watch and high school GPA (p-value<0.05). Specifically, for every one hour of TV watched, there is a 0.018-unit decline in GPA. However, the adjusted R-squared value is 0.056, which suggests that a linear regression is a very unreliable fit for this model.