Homework 3

hw3

scatterplot

linear_regression

emma_narkewicz

linear regression

Author

Emma Narkewicz

Published

April 11, 2023

Question 1

United Nations (Data file: UN11in alr4) The data in the file UN11 contains several variables, including ppgdp, the gross national product per person in U.S. dollars, and fertility, the birth rate per 1000 females, both from the year 2009. The data are for 199 localities, mostly UN member countries, but also other areas such as Hong Kong that are not independent countries. The data were collected from the United Nations (2011). We will study the dependence of fertility on ppgdp.

a)

Identify the predictor and the response.

If we are studying the dependence of fertility on ppgdp (gross national product per person) then:

Predictor = ppgdp

Response = fertility

Code

#load necessary packages
library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.0      ✔ stringr 1.4.1 
✔ readr   2.1.2      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Code

library(alr4)

Loading required package: car
Loading required package: carData

Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some

Loading required package: effects
lattice theme set by effectsTheme()
See ?effectsTheme for details.

Code

library(smss)

Draw the scatterplot of fertility on the vertical axis versus ppgdp on the horizontal axis and summarize the information in this graph. Does a straight-line mean function seem to be plausible for a summary of this graph?

Code

#linear graph

library(alr4)

ggplot(data = UN11, aes(x=ppgdp, y=fertility)) +   geom_point() +  geom_smooth(method = 'lm', se=F) + labs(title = "Scatterplot of Fertility vs. GDP per person (ppgdp)" )

`geom_smooth()` using formula 'y ~ x'

The above scatter plot of fertility vs. GDP per person (ppgdp) shows that the relationship between the two variables is not linear, but instead a curvilinear relationship. Therefore a straight-line mean function does not seems plausible. Fertility steeply drops as the ppdgp first increases but then levels out. Most of the data is concentrated around the left side of the x-axis with lower ppgdp

c

Draw the scatterplot of log(fertility) versus log(ppgdp) using natural logarithms. Does the simple linear regression model seem plausible for a summary of this graph?

Code

#logarithmic graph
ggplot(data = UN11, aes(x=log(ppgdp), y=log(fertility))) +   geom_point() +  geom_smooth(method = 'lm', se=F) + labs(title = "Scatterplot of Log of Fertility vs. Log of GDP per person (ppgdp)")

`geom_smooth()` using formula 'y ~ x'

A linear regression model of the natural logs of fertility vs. per person GDP (ppgdp) is a much better model of the relationship between the two variables than the prior model. As the log(ppgdp) increases, the log(fertility) decreases consistently. The data points are distributed consistently below and above the trendline.

Code

## mathematical correlation test

cor_lin <- cor.test(UN11$fertility, UN11$ppgdp)
cor_log <- cor.test(log(UN11$fertility), log(UN11$ppgdp))
cor_lin


    Pearson's product-moment correlation

data:  UN11$fertility and UN11$ppgdp
t = -6.877, df = 197, p-value = 7.903e-11
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.5456842 -0.3205140
sample estimates:
       cor 
-0.4399891

Code

cor_log


    Pearson's product-moment correlation

data:  log(UN11$fertility) and log(UN11$ppgdp)
t = -14.785, df = 197, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.7851376 -0.6519206
sample estimates:
       cor 
-0.7252483

Not only can that be seen visually in the scatter-plot above, but from the measure of Pearson’s correlation. The Pearson’s r correlation coefficient for for the linear regression model taking the natural logs of both variables (-.724) is a stronger correlation than for the linear regression model of the original data is (-0.440).

Question 2

Annual income, in dollars, is an explanatory variable in a regression analysis. For a British version of the report on the analysis, all responses are converted to British pounds sterling (1 pound equals about 1.33 dollars, as of 2016).

##a

How, if at all, does the slope of the prediction equation change?

slope = (y2-y1)/(x2-x1)

Code

#slope hypothetical annual income dollars & pounds 
annualincome_dollars = c(1, 2)
annualincome_pounds = 1.33 * (annualincome_dollars)
y = c(1,2)

slope_dollars =  (2-1)/(2-1) 
slope_dollars

[1] 1

Code

slope_pounds = (2-1)/(2.66 - 1.33)
slope_pounds

[1] 0.7518797

Converting income in dollars to income in pounds in an analysis where income is the explanatory variable (y) would result in a less steep slope after the conversion.

b

How, if at all, does the correlation change?

The correlation would not change after converting the units of income from dollars to British pounds. We learned in class that correlation is a standardized version of the slope, with its value not depending on units.

Question 3

Water runoff in the Sierras (Data file: water in alr4) Can Southern California’s water supply in future years be predicted from past data? One factor affecting water availability is stream runoff. If runoff could be predicted, engineers, planners, and policy makers could do their jobs more efficiently. The data file contains 43 years’ worth of precipitation measurements taken at six sites in the Sierra Nevada mountains (labeled APMAM, APSAB, APSLAKE, OPBPC, OPRC, and OPSLAKE) and stream runoff volume at a site near Bishop, California, labeled BSAAM. Draw the scatterplot matrix for these data and summarize the information available from these plots. (Hint: Use the pairs() function.)

Code

#create scatterplot matrix precipitation 
head(water)

  Year APMAM APSAB APSLAKE OPBPC  OPRC OPSLAKE  BSAAM
1 1948  9.13  3.58    3.91  4.10  7.43    6.47  54235
2 1949  5.28  4.82    5.20  7.55 11.11   10.26  67567
3 1950  4.20  3.77    3.67  9.52 12.20   11.35  66161
4 1951  4.60  4.46    3.93 11.14 15.15   11.13  68094
5 1952  7.15  4.99    4.88 16.34 20.05   22.81 107080
6 1953  9.70  5.65    4.91  8.88  8.15    7.41  67594

Code

pairs(water)

As shown in the scatter plot matrix above, the precipitation at APMAM, APSAB, & APSLAKE do not appear to be strongly correlated with run off in BSAAM, whereas the precipitation measured at OPBPC, OPRC, & OPSLAKE do appear to be more strongly & positively correlated with the run-off measured at BSAAM. The top left year variable does not appear to be correlated with any precipitation or runoff levels.

Looking at the correlation between lake precipitation, the precipitation of the 3 lakes beginning with “AP” appear to be strongly positively correlated with one another. The precipitation level of the 3 lakes beginning with “OP” are also strongly positively correlated with one another.

From the scatter plot matrix above, the precipitation at the OPBC, OPRC, and OPSLAKE seem to be the best predictors (explanatory variables) for the runoff level at BSAAM.

To test this, I created a linear regression model with all the lakes as the explanatory variable for runoff at BSAAM. The model with all lake participation as explanatory variables had a strong adjusted R-squared value of 0.9123, with the explanatory variables of OPRC & OPSLAKE both being statistically significant at the 0.05 level.

Code

#linear regression all lake precipitation
water1 <-lm(BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE, data = water)
summary(water1)


Call:
lm(formula = BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + 
    OPSLAKE, data = water)

Residuals:
   Min     1Q Median     3Q    Max 
-12690  -4936  -1424   4173  18542 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 15944.67    4099.80   3.889 0.000416 ***
APMAM         -12.77     708.89  -0.018 0.985725    
APSAB        -664.41    1522.89  -0.436 0.665237    
APSLAKE      2270.68    1341.29   1.693 0.099112 .  
OPBPC          69.70     461.69   0.151 0.880839    
OPRC         1916.45     641.36   2.988 0.005031 ** 
OPSLAKE      2211.58     752.69   2.938 0.005729 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7557 on 36 degrees of freedom
Multiple R-squared:  0.9248,    Adjusted R-squared:  0.9123 
F-statistic: 73.82 on 6 and 36 DF,  p-value: < 2.2e-16

Running the linear regression with just the OP lakes as explanatory variables gave an adjusted R squared of 0.8941, suggesting the majority of the variation in BSAAM run off is indeed explained by the precipation at the OP lakes.

Code

#linear model of just OP lakes

water2 <-lm(BSAAM ~  OPBPC + OPRC + OPSLAKE, data = water)
summary(water2)


Call:
lm(formula = BSAAM ~ OPBPC + OPRC + OPSLAKE, data = water)

Residuals:
     Min       1Q   Median       3Q      Max 
-15964.1  -6491.8   -404.4   4741.9  19921.2 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 22991.85    3545.32   6.485  1.1e-07 ***
OPBPC          40.61     502.40   0.081  0.93599    
OPRC         1867.46     647.04   2.886  0.00633 ** 
OPSLAKE      2353.96     771.71   3.050  0.00410 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8304 on 39 degrees of freedom
Multiple R-squared:  0.9017,    Adjusted R-squared:  0.8941 
F-statistic: 119.2 on 3 and 39 DF,  p-value: < 2.2e-16

Question 4

Professor ratings (Data file: Rateprof in alr4) In the website and online forum RateMyProfessors.com, students rate and comment on their instructors. Launched in 1999, the site includes millions of ratings on thousands of instructors. The data file includes the summaries of the ratings of 364 instructors at a large campus in the Midwest (Bleske-Rechek and Fritsch, 2011). Each instructor included in the data had at least 10 ratings over a several year period. Students provided ratings of 1–5 on quality, helpfulness, clarity, easiness of instructor’s courses, and raterInterest in the subject matter covered in the instructor’s courses. The data file provides the averages of these five ratings. Create a scatterplot matrix of these five variables. Provide a brief description of the relationships between the five ratings.

Code

#RateMyProfessor
head(Rateprof)

  gender numYears numRaters numCourses pepper discipline              dept
1   male        7        11          5     no        Hum           English
2   male        6        11          5     no        Hum Religious Studies
3   male       10        43          2     no        Hum               Art
4   male       11        24          5     no        Hum           English
5   male       11        19          7     no        Hum           Spanish
6   male       10        15          9     no        Hum           Spanish
   quality helpfulness  clarity easiness raterInterest sdQuality sdHelpfulness
1 4.636364    4.636364 4.636364 4.818182      3.545455 0.5518564     0.6741999
2 4.318182    4.545455 4.090909 4.363636      4.000000 0.9020179     0.9341987
3 4.790698    4.720930 4.860465 4.604651      3.432432 0.4529343     0.6663898
4 4.250000    4.458333 4.041667 2.791667      3.181818 0.9325048     0.9315329
5 4.684211    4.684211 4.684211 4.473684      4.214286 0.6500112     0.8200699
6 4.233333    4.266667 4.200000 4.533333      3.916667 0.8632717     1.0327956
  sdClarity sdEasiness sdRaterInterest
1 0.5045250  0.4045199       1.1281521
2 0.9438798  0.5045250       1.0744356
3 0.4129681  0.5407021       1.2369438
4 0.9990938  0.5882300       1.3322506
5 0.5823927  0.6117753       0.9749613
6 0.7745967  0.6399405       0.6685579

Code

Rateprof2 <- Rateprof %>%
  select(c(quality, clarity, helpfulness, easiness, raterInterest))  

pairs(Rateprof2)

From the scatter plot matrix, it is shown there seems to be a strong positive correlation between an instructor’s quality and clarity and quality and helpfulness. The correlation between an instructor’s clarity and helpfulness also appears relatively strong & positive. Easiness and raterInterest have a much weaker correlation with any other variables.

Question 5

For the student.survey data file in the smss package, conduct regression analyses relating (by convention, y denotes the outcome variable, x denotes the explanatory variable)

a

Graphically portray how the explanatory variable relates to the outcome variable in each of the two cases

Code

#load & preview data
library(smss)
data("student.survey")
head(student.survey)

  subj ge ag  hi  co   dh   dr tv sp ne ah    ve pa           pi           re
1    1  m 32 2.2 3.5    0  5.0  3  5  0  0 FALSE  r conservative   most weeks
2    2  f 23 2.1 3.5 1200  0.3 15  7  5  6 FALSE  d      liberal occasionally
3    3  f 27 3.3 3.0 1300  1.5  0  4  3  0 FALSE  d      liberal   most weeks
4    4  f 35 3.5 3.2 1500  8.0  5  5  6  3 FALSE  i     moderate occasionally
5    5  m 23 3.1 3.5 1600 10.0  6  6  3  0 FALSE  i very liberal        never
6    6  m 39 3.5 3.5  350  3.0  4  5  7  0 FALSE  d      liberal occasionally
     ab    aa    ld
1 FALSE FALSE FALSE
2 FALSE FALSE    NA
3 FALSE FALSE    NA
4 FALSE FALSE FALSE
5 FALSE FALSE FALSE
6 FALSE FALSE    NA

Code

summary(student.survey)

      subj       ge           ag              hi              co       
 Min.   : 1.00   f:31   Min.   :22.00   Min.   :2.000   Min.   :2.600  
 1st Qu.:15.75   m:29   1st Qu.:24.00   1st Qu.:3.000   1st Qu.:3.175  
 Median :30.50          Median :26.50   Median :3.350   Median :3.500  
 Mean   :30.50          Mean   :29.17   Mean   :3.308   Mean   :3.453  
 3rd Qu.:45.25          3rd Qu.:31.00   3rd Qu.:3.625   3rd Qu.:3.725  
 Max.   :60.00          Max.   :71.00   Max.   :4.000   Max.   :4.000  
                                                                       
       dh             dr               tv               sp        
 Min.   :   0   Min.   : 0.200   Min.   : 0.000   Min.   : 0.000  
 1st Qu.: 205   1st Qu.: 1.450   1st Qu.: 3.000   1st Qu.: 3.000  
 Median : 640   Median : 2.000   Median : 6.000   Median : 5.000  
 Mean   :1232   Mean   : 3.818   Mean   : 7.267   Mean   : 5.483  
 3rd Qu.:1350   3rd Qu.: 5.000   3rd Qu.:10.000   3rd Qu.: 7.000  
 Max.   :8000   Max.   :20.000   Max.   :37.000   Max.   :16.000  
                                                                  
       ne               ah             ve          pa    
 Min.   : 0.000   Min.   : 0.000   Mode :logical   d:21  
 1st Qu.: 2.000   1st Qu.: 0.000   FALSE:60        i:24  
 Median : 3.000   Median : 0.500                   r:15  
 Mean   : 4.083   Mean   : 1.433                         
 3rd Qu.: 5.250   3rd Qu.: 2.000                         
 Max.   :14.000   Max.   :11.000                         
                                                         
                     pi                re         ab              aa         
 very liberal         : 8   never       :15   Mode :logical   Mode :logical  
 liberal              :24   occasionally:29   FALSE:60        FALSE:59       
 slightly liberal     : 6   most weeks  : 7                   NA's :1        
 moderate             :10   every week  : 9                                  
 slightly conservative: 6                                                    
 conservative         : 4                                                    
 very conservative    : 2                                                    
     ld         
 Mode :logical  
 FALSE:44       
 NA's :16

Code

##?student.survey

y = political ideology and x = religiosity,

Political ideology and religiosity are currently coded as ordinal variables.

There are 7 levels of political idealogy: - very liberal
- liberal - slightly liberal - moderate - slightly conservative - conservative - very conservative

There are 4 levels of religiosity: - never - occasionally - most weeks - every week

Code

#Scatterplot Religiousity vs. Political Ideology
ggplot(data = student.survey, aes(x = re , y = pi)) +   geom_smooth(method = 'lm', se=F) +  geom_point() + labs(title = "Scatterplot: Religiousity vs. Political Ideology", x = "Religiousity", y = "Political Ideology")

`geom_smooth()` using formula 'y ~ x'

From the scatter-plot there appears to be a fairly strong positive correlation between religiosity and political ideology, with the more often someone attends religious services the more conservative their political ideology. A scatter plot model suggests that both ordinal scales have a fixed distance between responses, which is not correct for either variable.

To supplement the scatter plot I also created a bar graph to

Code

#Bargraph pi vs re

ggplot(data = student.survey, aes(x = re, fill = pi)) + geom_bar(position = "fill") + labs(title = "Bar Graph: Religiousity vs. Political Ideology", x = "Religiousity", y = "Political Ideology")

The bar graph shows a high concentration of people who never or occasionally attend religious services were mostly liberal (indigo-purple), whereas people who report attending religious services most weeks were more conservative (green). People who reported attending religious services every week had mostly conservative (green) and very conservative (yellow) political idealogies.

y = high school GPA and x = hours of TV watching.

Code

#scatterplot hi vs. tv
ggplot(data = df, aes(x = tv , y = hi)) +   geom_point() +  geom_smooth(method = 'lm', se=F) + labs(title = "Scatterplot Highschool GPA vs. Hours of tv watched", x = "TV(avg hours per week)", y = "High School GPA")

Error in `ggplot()`:
!   You're passing a function as global data.
  Have you misspelled the `data` argument in `ggplot()`

A scatter plot works well for modeling two numeric variables, high school GPA & avg hours of tv watched per week. Looking at the data fitted with a trend line association between most data points are above of below the trend line and in the upper left corner of the scatter plot, suggesting most students watch been 0-10 hours of TV have a GPA of 3.0 or above.

It does not look like from the scatter plot there is an association between these variables. The trend line suggests if there is a correlation between the variables, it is negative in nature, with more hours of TV correlating with lower high school GPA. This will be investigated further in part b.

b Summarize and interpret results of inferential analyses.

y = political ideology and x = religiosity,

In order to create a linear regression model, both ordinal variables needed to be converted to numeric type. The resulting model suggests

Code

#Linear regression of pi & re
lm_pi <-lm(as.numeric(pi) ~ as.numeric(re) , data = student.survey)
summary(lm_pi)


Call:
lm(formula = as.numeric(pi) ~ as.numeric(re), data = student.survey)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.81243 -0.87160  0.09882  1.12840  3.09882 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)      0.9308     0.4252   2.189   0.0327 *  
as.numeric(re)   0.9704     0.1792   5.416 1.22e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.345 on 58 degrees of freedom
Multiple R-squared:  0.3359,    Adjusted R-squared:  0.3244 
F-statistic: 29.34 on 1 and 58 DF,  p-value: 1.221e-06

The linear regression model has a positive coefficient of 0.97, suggesting for “one unit” increase in religiosity there is nearly a “one unit” increase in political ideology. One unit is in quotes because these variables are ordinal and there isn’t actually fixed distance in units between them. Nonetheless, there is a positive, statistically significant correlation (p value < 0.001). The adjusted R-squared value is 0.3244 which is weak-moderate suggests there is a correlation between the two variables

y = high school GPA and x = hours of TV watching. ,

Code

#linear regression model gpa vs tv
lm_gpa <-lm(hi ~ tv , data = student.survey) 
summary(lm_gpa)


Call:
lm(formula = hi ~ tv, data = student.survey)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.2583 -0.2456  0.0417  0.3368  0.7051 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.441353   0.085345  40.323   <2e-16 ***
tv          -0.018305   0.008658  -2.114   0.0388 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4467 on 58 degrees of freedom
Multiple R-squared:  0.07156,   Adjusted R-squared:  0.05555 
F-statistic: 4.471 on 1 and 58 DF,  p-value: 0.03879

The linear regression shows a negative correlation between hours of tv watched and high school GPA that is statistically significant at the 0.05 level (p = 0.0388), with ever hour of tv watched a week resulting in a 0.018 reduction in high school GPA. Looking at the adjusted R-squared of 0.05555 is extremely low, suggesting that there is not strong association between the 2 variables. A linear model does not appear to be a good fit for this data.