The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
Code
library(magrittr)
Attaching package: 'magrittr'
The following object is masked from 'package:tidyr':
extract
Code
library(ggplot2)library(markdown)library(ggtext)
Warning: package 'ggtext' was built under R version 4.2.2
Code
library(readxl)
Warning: package 'readxl' was built under R version 4.2.2
Question 1
United Nations (Data file: UN11in alr4) The data in the file UN11 contains several variables,
including ppgdp, the gross national product per person in U.S. dollars, and fertility, the birth
rate per 1000 females, both from the year 2009. The data are for 199 localities, mostly UN
member countries, but also other areas such as Hong Kong that are not independent countries.
The data were collected from the United Nations (2011). We will study the dependence of
fertility on ppgdp.
(a) Identify the predictor and the response.
(b) Draw the scatterplot of fertility on the vertical axis versus ppgdp on the horizontal axis
and summarize the information in this graph. Does a straight-line mean function seem to
be plausible for a summary of this graph?
(c) Draw the scatterplot of log(fertility) versus log(ppgdp) using natural logarithms. Does
the simple linear regression model seem plausible for a summary of this graph? If you use
a different base of logarithms, the shape of the graph won’t change, but the values on the
axes will change.
Code
# read in data:UN11 <-readRDS(url('https://github.com/omerfyalcin/colab-data/blob/main/UN.rds?raw=true'))head(UN11)
region group fertility ppgdp lifeExpF pctUrban
Afghanistan Asia other 5.968 499.0 49.49 23
Albania Europe other 1.525 3677.2 80.40 53
Algeria Africa africa 2.142 4473.0 75.00 67
American Samoa <NA> <NA> NA NA NA NA
Angola Africa africa 5.135 4321.9 53.17 59
Anguilla Caribbean other 2.000 13750.1 81.10 100
infantMortality
Afghanistan 124.53500
Albania 16.56100
Algeria 21.45800
American Samoa 11.29389
Angola 96.19100
Anguilla NA
A)
In studying the dependence of fertility on GDP, fertility is the predictor and GDP is the response.
In viewing this scatterplot, we can see that higher rates of birth correlates with lower per-person GPD. A straight-line mean function does not seem plausible for this graph, as the line extends far beyond the range of data points.
Using natural logarithms, we see a much more plausible summary for this graph.
Question 2
Annual income, in dollars, is an explanatory variable in a regression analysis. For a British
version of the report on the analysis, all responses are converted to British pounds sterling (1 pound
equals about 1.33 dollars, as of 2016).
(a) How, if at all, does the slope of the prediction equation change?
(b) How, if at all, does the correlation change?
A)
The slope of the prediction will need to be divided by 1.33 to account for the conversion to British pounds.
B)
The correlation will not change.
Question 3
Water runoff in the Sierras (Data file: water in alr4) Can Southern California’s water
supply in future years be predicted from past data? One factor affecting water availability is stream
runoff. If runoff could be predicted, engineers, planners, and policy makers could do their jobs
more efficiently. The data file contains 43 years’ worth of precipitation measurements taken at six
sites in the Sierra Nevada mountains (labeled APMAM, APSAB, APSLAKE, OPBPC, OPRC, and
OPSLAKE) and stream runoff volume at a site near Bishop, California, labeled BSAAM. Draw
the scatterplot matrix for these data and summarize the information available from these
plots. (Hint: Use the pairs() function.)
Code
library(alr4)
Warning: package 'alr4' was built under R version 4.2.3
Loading required package: car
Warning: package 'car' was built under R version 4.2.3
Loading required package: carData
Warning: package 'carData' was built under R version 4.2.3
Attaching package: 'car'
The following object is masked from 'package:dplyr':
recode
Loading required package: effects
Warning: package 'effects' was built under R version 4.2.3
lattice theme set by effectsTheme()
See ?effectsTheme for details.
Attaching package: 'alr4'
The following object is masked _by_ '.GlobalEnv':
UN11
Based on the scatterplot matrix, stream runoff near Bishop, California appears to be correlated with precipitation at “O” sites, but not with “A” sites or year.
Question 4
Professor ratings (Data file: Rateprof in alr4) In the website and online forum
RateMyProfessors.com, students rate and comment on their instructors. Launched in 1999, the site
includes millions of ratings on thousands of instructors. The data file includes the summaries of
the ratings of 364 instructors at a large campus in the Midwest (Bleske-Rechek and Fritsch, 2011).
Each instructor included in the data had at least 10 ratings over a several year period. Students
provided ratings of 1–5 on quality, helpfulness, clarity, easiness of instructor’s courses, and
raterInterest in the subject matter covered in the instructor’s courses. The data file provides the
averages of these five ratings. Create a scatterplot matrix of these five variables. Provide a
brief description of the relationships between the five ratings.
Code
head(Rateprof)
gender numYears numRaters numCourses pepper discipline dept
1 male 7 11 5 no Hum English
2 male 6 11 5 no Hum Religious Studies
3 male 10 43 2 no Hum Art
4 male 11 24 5 no Hum English
5 male 11 19 7 no Hum Spanish
6 male 10 15 9 no Hum Spanish
quality helpfulness clarity easiness raterInterest sdQuality sdHelpfulness
1 4.636364 4.636364 4.636364 4.818182 3.545455 0.5518564 0.6741999
2 4.318182 4.545455 4.090909 4.363636 4.000000 0.9020179 0.9341987
3 4.790698 4.720930 4.860465 4.604651 3.432432 0.4529343 0.6663898
4 4.250000 4.458333 4.041667 2.791667 3.181818 0.9325048 0.9315329
5 4.684211 4.684211 4.684211 4.473684 4.214286 0.6500112 0.8200699
6 4.233333 4.266667 4.200000 4.533333 3.916667 0.8632717 1.0327956
sdClarity sdEasiness sdRaterInterest
1 0.5045250 0.4045199 1.1281521
2 0.9438798 0.5045250 1.0744356
3 0.4129681 0.5407021 1.2369438
4 0.9990938 0.5882300 1.3322506
5 0.5823927 0.6117753 0.9749613
6 0.7745967 0.6399405 0.6685579
Code
pairs(Rateprof[,8:12])
Quality: Based on the scatterplot matrix, we can see that there is a positive correlation between quality and helpfulness, as well as between quality and clarity (though we see one outlier in this data set that represents a comparitively higher rate of clarity linked with a lower rate of quality). There is a slightly positive correlation between quality and easiness, but no definitive correlation between quality and rater interest.
Helpfulness: Based on the scatterplot matrix, we see (as mentioned) a positive correlation between helpfulness and quality, as well as between helpfulness and clarity (though the spread of data points is slightly larger). Similar to quality, there is a slightly positive correlation between helpfulness and easiness, but no definitive correlation between helpfulness and rater interest.
Clarity: Based on the scatterplot, we see (as mentioned) a positive correlation between clarity and quality, as well as between clarity and helpfulness (though the spread of data points is slightly larger). Similarly to quality and helpfulness, there is no definitive correlation between clarity and easiness or clarity and rater interest.
Easiness: There are slightly positive correlations between easiness and quality, easiness and helpfulness, and easiness and clarity. There is no definitive correlation between easiness and rater interest.
Rater Interest: There are no definitive correlations between rater interest and any of the remaining four variables.
Overall, quality, helpfulness, and clarity provide more insight related to professor rating than easiness and rater interest.
Question 5
For the student.survey data file in the smss package, conduct regression analyses relating
(by convention, y denotes the outcome variable, x denotes the explanatory variable)
(i) y = political ideology and x = religiosity,
(ii) y = high school GPA and x = hours of TV watching.
(You can use ?student.survey in the R console, after loading the package, to see what each variable
means.)
(a) Graphically portray how the explanatory variable relates to the outcome variable in
each of the two cases
(b) Summarize and interpret results of inferential analyses.
A)
Code
library(smss)
Warning: package 'smss' was built under R version 4.2.3
Code
data(student.survey)head(student.survey)
subj ge ag hi co dh dr tv sp ne ah ve pa pi re
1 1 m 32 2.2 3.5 0 5.0 3 5 0 0 FALSE r conservative most weeks
2 2 f 23 2.1 3.5 1200 0.3 15 7 5 6 FALSE d liberal occasionally
3 3 f 27 3.3 3.0 1300 1.5 0 4 3 0 FALSE d liberal most weeks
4 4 f 35 3.5 3.2 1500 8.0 5 5 6 3 FALSE i moderate occasionally
5 5 m 23 3.1 3.5 1600 10.0 6 6 3 0 FALSE i very liberal never
6 6 m 39 3.5 3.5 350 3.0 4 5 7 0 FALSE d liberal occasionally
ab aa ld
1 FALSE FALSE FALSE
2 FALSE FALSE NA
3 FALSE FALSE NA
4 FALSE FALSE FALSE
5 FALSE FALSE FALSE
6 FALSE FALSE NA
Installing package into 'C:/Users/caitr/AppData/Local/R/win-library/4.2'
(as 'lib' is unspecified)
Error in contrib.url(repos, "source"): trying to use CRAN without setting a mirror
Code
library(stargazer)
Please cite as:
Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
Code
poli_relig <-lm(as.numeric(poli_ideo) ~as.numeric(relig), data = survey)gpa_tv <-lm(gpa_hs ~ TV, data = survey)stargazer(poli_relig, gpa_tv, type ='text', dep.var.labels =c('Political Ideology', 'High School GPA'), covariate.labels =c('Religiosity', 'Hours of TV'))
================================================================
Dependent variable:
----------------------------------
Political Ideology High School GPA
(1) (2)
----------------------------------------------------------------
Religiosity 0.970***
(0.179)
Hours of TV -0.018**
(0.009)
Constant 0.931** 3.441***
(0.425) (0.085)
----------------------------------------------------------------
Observations 60 60
R2 0.336 0.072
Adjusted R2 0.324 0.056
Residual Std. Error (df = 58) 1.345 0.447
F Statistic (df = 1; 58) 29.336*** 4.471**
================================================================
Note: *p<0.1; **p<0.05; ***p<0.01
Based on this inferential analysis, we can see that there is a statistically significant and positive correlation between religiosity and political ideology (p-value<0.01). Specifically, there is for every “one unit of increase” in religiosity, there is a “0.970-unit increase” in political ideology. However, the adjusted R-squared value is 0.324, which suggests that a linear regression is not a very good fit for this model.
However, we also see that there is a statistically significant negative correlation between hours of TV watch and high school GPA (p-value<0.05). Specifically, for every one hour of TV watched, there is a 0.018-unit decline in GPA. However, the adjusted R-squared value is 0.056, which suggests that a linear regression is a very unreliable fit for this model.
Source Code
---title: "Homework 3"author: "Caitlin Rowley"description: "Homework 3"date: "04/11/2023"format: html: toc: true code-fold: true code-copy: true code-tools: truecategories: - hw3 - desriptive statistics - probability---```{r}# load librarieslibrary(tidyr)library(dplyr)library(magrittr)library(ggplot2)library(markdown)library(ggtext)library(readxl)```### Question 1| United Nations (Data file: UN11in alr4) The data in the file UN11 contains several variables,| including ppgdp, the gross national product per person in U.S. dollars, and fertility, the birth| rate per 1000 females, both from the year 2009. The data are for 199 localities, mostly UN| member countries, but also other areas such as Hong Kong that are not independent countries.| The data were collected from the United Nations (2011). We will study the dependence of| fertility on ppgdp.| (a) Identify the predictor and the response.| (b) Draw the scatterplot of fertility on the vertical axis versus ppgdp on the horizontal axis| and summarize the information in this graph. Does a straight-line mean function seem to| be plausible for a summary of this graph?| (c) Draw the scatterplot of log(fertility) versus log(ppgdp) using natural logarithms. Does| the simple linear regression model seem plausible for a summary of this graph? If you use| a different base of logarithms, the shape of the graph won't change, but the values on the| axes will change.```{r}# read in data:UN11 <-readRDS(url('https://github.com/omerfyalcin/colab-data/blob/main/UN.rds?raw=true'))head(UN11)```#### A)In studying the dependence of fertility on GDP, fertility is the predictor and GDP is the response.#### B)```{r}# generate scatterplot:scatter_1 <-ggplot(UN11, aes(x=ppgdp, y=fertility)) +geom_point( color="black")scatter_1# with linear trendscatter_2 <-ggplot(UN11, aes(x=ppgdp, y=fertility)) +geom_point() +geom_smooth(method=lm, se=TRUE)scatter_2```In viewing this scatterplot, we can see that higher rates of birth correlates with lower per-person GPD. A straight-line mean function does not seem plausible for this graph, as the line extends far beyond the range of data points.#### C)```{r}# generate scatterplot using natural logarithms:scatter_3 <-ggplot(UN11, aes(x=log(ppgdp), y=log(fertility))) +geom_point() +geom_smooth(method=lm, se=TRUE)scatter_3```Using natural logarithms, we see a much more plausible summary for this graph.### Question 2| Annual income, in dollars, is an explanatory variable in a regression analysis. For a British| version of the report on the analysis, all responses are converted to British pounds sterling (1 pound| equals about 1.33 dollars, as of 2016).| (a) How, if at all, does the slope of the prediction equation change?| (b) How, if at all, does the correlation change?#### A)The slope of the prediction will need to be divided by 1.33 to account for the conversion to British pounds.#### B)The correlation will not change.### Question 3| Water runoff in the Sierras (Data file: water in alr4) Can Southern California's water| supply in future years be predicted from past data? One factor affecting water availability is stream| runoff. If runoff could be predicted, engineers, planners, and policy makers could do their jobs| more efficiently. The data file contains 43 years' worth of precipitation measurements taken at six| sites in the Sierra Nevada mountains (labeled APMAM, APSAB, APSLAKE, OPBPC, OPRC, and| OPSLAKE) and stream runoff volume at a site near Bishop, California, labeled BSAAM. Draw| the scatterplot matrix for these data and summarize the information available from these| plots. (Hint: Use the pairs() function.)```{r}library(alr4)head(water)``````{r}pairs(water)```Based on the scatterplot matrix, stream runoff near Bishop, California appears to be correlated with precipitation at "O" sites, but not with "A" sites or year.### Question 4| Professor ratings (Data file: Rateprof in alr4) In the website and online forum| RateMyProfessors.com, students rate and comment on their instructors. Launched in 1999, the site| includes millions of ratings on thousands of instructors. The data file includes the summaries of| the ratings of 364 instructors at a large campus in the Midwest (Bleske-Rechek and Fritsch, 2011).| Each instructor included in the data had at least 10 ratings over a several year period. Students| provided ratings of 1--5 on quality, helpfulness, clarity, easiness of instructor's courses, and| raterInterest in the subject matter covered in the instructor's courses. The data file provides the| averages of these five ratings. Create a scatterplot matrix of these five variables. Provide a| brief description of the relationships between the five ratings.```{r}head(Rateprof)``````{r}pairs(Rateprof[,8:12])```**Quality**: Based on the scatterplot matrix, we can see that there is a positive correlation between quality and helpfulness, as well as between quality and clarity (though we see one outlier in this data set that represents a comparitively higher rate of clarity linked with a lower rate of quality). There is a slightly positive correlation between quality and easiness, but no definitive correlation between quality and rater interest.**Helpfulness**: Based on the scatterplot matrix, we see (as mentioned) a positive correlation between helpfulness and quality, as well as between helpfulness and clarity (though the spread of data points is slightly larger). Similar to quality, there is a slightly positive correlation between helpfulness and easiness, but no definitive correlation between helpfulness and rater interest.**Clarity**: Based on the scatterplot, we see (as mentioned) a positive correlation between clarity and quality, as well as between clarity and helpfulness (though the spread of data points is slightly larger). Similarly to quality and helpfulness, there is no definitive correlation between clarity and easiness or clarity and rater interest.**Easiness**: There are slightly positive correlations between easiness and quality, easiness and helpfulness, and easiness and clarity. There is no definitive correlation between easiness and rater interest.**Rater Interest**: There are no definitive correlations between rater interest and any of the remaining four variables.Overall, quality, helpfulness, and clarity provide more insight related to professor rating than easiness and rater interest.### Question 5| For the student.survey data file in the smss package, conduct regression analyses relating| (by convention, y denotes the outcome variable, x denotes the explanatory variable)| (i) y = political ideology and x = religiosity,| (ii) y = high school GPA and x = hours of TV watching.| (You can use ?student.survey in the R console, after loading the package, to see what each variable| means.)| (a) Graphically portray how the explanatory variable relates to the outcome variable in| each of the two cases| (b) Summarize and interpret results of inferential analyses.#### A)```{r}library(smss)data(student.survey)head(student.survey)``````{r}library(dplyr)survey <- student.survey %>%rename("poli_ideo"="pi", "relig"="re", "TV"="tv", "gpa_hs"="hi")lm(poli_ideo ~ relig, data=survey)ggplot(data = survey, aes(x = relig, fill = poli_ideo)) +geom_bar(position ="fill") +scale_fill_brewer(palette="PiYG")``````{r}lm(gpa_hs ~ TV, data=survey)ggplot(data = survey, aes(x=TV, y=gpa_hs)) +geom_point() +geom_smooth(method=lm, se=TRUE)```#### B)```{r}install.packages("stargazer")library(stargazer)poli_relig <-lm(as.numeric(poli_ideo) ~as.numeric(relig), data = survey)gpa_tv <-lm(gpa_hs ~ TV, data = survey)stargazer(poli_relig, gpa_tv, type ='text', dep.var.labels =c('Political Ideology', 'High School GPA'), covariate.labels =c('Religiosity', 'Hours of TV'))```Based on this inferential analysis, we can see that there is a statistically significant and positive correlation between religiosity and political ideology (p-value\<0.01). Specifically, there is for every "one unit of increase" in religiosity, there is a "0.970-unit increase" in political ideology. However, the adjusted R-squared value is 0.324, which suggests that a linear regression is not a very good fit for this model.However, we also see that there is a statistically significant negative correlation between hours of TV watch and high school GPA (p-value\<0.05). Specifically, for every one hour of TV watched, there is a 0.018-unit decline in GPA. However, the adjusted R-squared value is 0.056, which suggests that a linear regression is a very unreliable fit for this model.