Homework 3

HW3

Author

Diana Rinker

Published

April 11, 2023

DACSS 603, spring 2023

Homework 3, Diana Rinker.

Loading necessary libraries:

Code

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Code

library(ggplot2)
library(alr4)

Loading required package: car

Loading required package: carData


Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

Loading required package: effects

lattice theme set by effectsTheme()
See ?effectsTheme for details.

Code

library(smss)

Question 1

United Nations (Data file: UN11 in alr4) The data in the file UN11 contains several variables, including ppgdp, the gross national product per person in U.S. dollars, and fertility, the birth rate per 1000 females, both from the year 2009. The data are for 199 localities, mostly UN member countries, but also other areas such as Hong Kong that are not independent countries. The data were collected from the United Nations (2011). We will study the dependence of fertility on ppgdp.

Identify the predictor and the response.

DV/response: fertility

IV/predictor: ppgdp
Draw the scatterplot of fertility on the vertical axis versus ppgdp on the horizontal axis and summarize the information in this graph. Does a straight-line mean function seem to be plausible for a summary of this graph?

Code

data(UN11)
str(UN11)

'data.frame':   199 obs. of  6 variables:
 $ region   : Factor w/ 8 levels "Africa","Asia",..: 2 4 1 1 3 5 2 3 8 4 ...
 $ group    : Factor w/ 3 levels "oecd","other",..: 2 2 3 3 2 2 2 2 1 1 ...
 $ fertility: num  5.97 1.52 2.14 5.13 2 ...
 $ ppgdp    : num  499 3677 4473 4322 13750 ...
 $ lifeExpF : num  49.5 80.4 75 53.2 81.1 ...
 $ pctUrban : num  23 53 67 59 100 93 64 47 89 68 ...
 - attr(*, "na.action")= 'omit' Named int [1:34] 4 5 8 28 41 67 68 72 79 83 ...
  ..- attr(*, "names")= chr [1:34] "Am Samoa" "Andorra" "Antigua and Barbuda" "Br Virigin Is" ...

Code

    ggplot(data=UN11, mapping= aes(x=ppgdp, y =fertility ))+
      geom_point()+
      stat_smooth(method = "lm", se = T)+
      geom_hline(yintercept = mean(UN11$fertility), color = "red")+
      labs(title="Scatterplot of fertility by ppgdp")+
      stat_smooth(method = "lm", se = T)

`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

From the graph above it is obvious, that the straight line mean is not representative of the data. Even if be draw a linear model with “stat_smooth(method =”lm”, se = T)“, the line is still far from most datapoints ( which indicate high value of error in the model)

Draw the scatterplot of log(fertility) versus log(ppgdp) using natural logarithms. Does the simple linear regression model seem plausible for a summary of this graph? If you use a different base of logarithms, the shape of the graph won’t change, but the values on the axes will change.

Code

   ggplot(data=UN11, mapping= aes(x=log(ppgdp), y =log(fertility) ))+
      geom_point()+
      stat_smooth(method = "lm", se = T)+
      geom_hline(yintercept = mean(log(UN11$fertility)), color = "red")+
      labs(title="Scatterplot of log(fertility) by log(ppgdp) ")

`geom_smooth()` using formula = 'y ~ x'

Simple linear regression seems much more plausible on this logged scatter plot.

Below, I am using different bases (5 and 7) of the logarithms. The shape of a line doesn’t change, only x-scale limits change:

Code

ggplot(data=UN11, mapping= aes(x=log(ppgdp, base =5), y =log(fertility), base =5 ))+
      geom_point()+
      stat_smooth(method = "lm", se = T)+
      geom_hline(yintercept = mean(log(UN11$fertility),  base =5), color = "red")+
      labs(title="Scatterplot of log(fertility) by log(ppgdp) base =5 ")

`geom_smooth()` using formula = 'y ~ x'

Code

ggplot(data=UN11, mapping= aes(x=log(ppgdp, base =7), y =log(fertility), base =7))+
      geom_point()+
      stat_smooth(method = "lm", se = T)+
      geom_hline(yintercept = mean(log(UN11$fertility),  base =7), color = "red")+
      labs(title="Scatterplot of log(fertility) by log(ppgdp), base =7 ")

`geom_smooth()` using formula = 'y ~ x'

Question 2

Annual income, in dollars, is an explanatory variable in a regression analysis. For a British version of the report on the analysis, all responses are converted to British pounds sterling (1 pound equals about 1.33 dollars, as of 2016).

(a) How, if at all, does the slope of the prediction equation change?

(b) How, if at all, does the correlation change?

To answer these questions, I will run linear regressions for the ppgpd in dollars and pounds and compare results.

Code

UN11$ppgdp.pound<-UN11$ppgdp*1.33
fit1<- lm(lifeExpF ~ ppgdp, data = UN11)
summary(fit1)


Call:
lm(formula = lifeExpF ~ ppgdp, data = UN11)

Residuals:
    Min      1Q  Median      3Q     Max 
-20.825  -4.889   2.618   6.619  11.299 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 6.837e+01  7.370e-01  92.762   <2e-16 ***
ppgdp       3.018e-04  3.274e-05   9.218   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.483 on 197 degrees of freedom
Multiple R-squared:  0.3014,    Adjusted R-squared:  0.2978 
F-statistic: 84.98 on 1 and 197 DF,  p-value: < 2.2e-16

Code

fit2<- lm(lifeExpF ~ ppgdp.pound, data = UN11)
summary(fit2)


Call:
lm(formula = lifeExpF ~ ppgdp.pound, data = UN11)

Residuals:
    Min      1Q  Median      3Q     Max 
-20.825  -4.889   2.618   6.619  11.299 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 6.837e+01  7.370e-01  92.762   <2e-16 ***
ppgdp.pound 2.269e-04  2.462e-05   9.218   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.483 on 197 degrees of freedom
Multiple R-squared:  0.3014,    Adjusted R-squared:  0.2978 
F-statistic: 84.98 on 1 and 197 DF,  p-value: < 2.2e-16

Visualizing the model with both variables (ppgdp and ppgdp.pound ) produces the same model:

Code

ggplot( data=UN11, mapping=aes(x= ppgdp , y=lifeExpF))+
  geom_point() +
  geom_smooth(method = 'lm')+
      labs(title="Scatterplot of Female life expectancy by annual income ", x="PPGDP in US dollars", y="Life expectancy")

`geom_smooth()` using formula = 'y ~ x'

Code

ggplot( data=UN11, mapping=aes(x= ppgdp.pound , y=lifeExpF))+
  geom_point() +
  geom_smooth(method = 'lm')+
      labs(title="Scatterplot of Female life expectancy by annual income ", x="PPGDP in British pounds", y="Life expectancy")

`geom_smooth()` using formula = 'y ~ x'

Visualizing logged variables ( for the US dollars and British pounds) :

Code

ggplot( data=UN11, mapping=aes(x= log(ppgdp) , y=lifeExpF   )     )+
  geom_point() +
  geom_smooth(method = 'lm')+
      labs(title="Scatterplot of Female life expectancy by annual income ", x="Logged PPGDP in US dollars", y="Life expectancy")

`geom_smooth()` using formula = 'y ~ x'

Code

ggplot( data=UN11, mapping=aes(x= log(ppgdp.pound) , y=lifeExpF    )     )+
  geom_point() +
  geom_smooth(method = 'lm')+
      labs(title="Scatterplot of Female life expectancy by annual income ", x="Logged PPGDP in British pounds", y="Life expectancy")

`geom_smooth()` using formula = 'y ~ x'

There is no change in slope and correlation between variablle in US dollars and British pounds.

Code

fit_logged<- lm(lifeExpF~ log(ppgdp), data = UN11)

summary(fit_logged)


Call:
lm(formula = lifeExpF ~ log(ppgdp), data = UN11)

Residuals:
    Min      1Q  Median      3Q     Max 
-25.749  -2.879   1.280   3.987  12.345 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  29.8148     2.5314   11.78   <2e-16 ***
log(ppgdp)    5.0188     0.2942   17.06   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.448 on 197 degrees of freedom
Multiple R-squared:  0.5964,    Adjusted R-squared:  0.5943 
F-statistic: 291.1 on 1 and 197 DF,  p-value: < 2.2e-16

However, using log() function on x variable allowed us to build a better fitted model with R^2 increased from 0.30 to 0.60. I can conclude that 1% change in PPGDP correspons with 5 % increase in Life expectancy.

Question 3

Water runoff in the Sierras (Data file: water in alr4) Can Southern California’s water supply in future years be predicted from past data? One factor affecting water availability is stream runoff. If runoff could be predicted, engineers, planners, and policy makers could do their jobs

more efficiently. The data file contains 43 years’ worth of precipitation measurements taken at six sites in the Sierra Nevada mountains (labeled APMAM, APSAB, APSLAKE, OPBPC, OPRC, and OPSLAKE) and stream runoff volume at a site near Bishop, California, labeled BSAAM. Draw the scatterplot matrix for these data and summarize the information available from these plots. (Hint: Use the pairs() function.)

Code

library(alr4)
data(water)
str(water)

'data.frame':   43 obs. of  8 variables:
 $ Year   : int  1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 ...
 $ APMAM  : num  9.13 5.28 4.2 4.6 7.15 9.7 5.02 6.7 10.5 9.1 ...
 $ APSAB  : num  3.58 4.82 3.77 4.46 4.99 5.65 1.45 7.44 5.85 6.13 ...
 $ APSLAKE: num  3.91 5.2 3.67 3.93 4.88 4.91 1.77 6.51 3.38 4.08 ...
 $ OPBPC  : num  4.1 7.55 9.52 11.14 16.34 ...
 $ OPRC   : num  7.43 11.11 12.2 15.15 20.05 ...
 $ OPSLAKE: num  6.47 10.26 11.35 11.13 22.81 ...
 $ BSAAM  : int  54235 67567 66161 68094 107080 67594 65356 67909 92715 70024 ...

Code

head(water)

  Year APMAM APSAB APSLAKE OPBPC  OPRC OPSLAKE  BSAAM
1 1948  9.13  3.58    3.91  4.10  7.43    6.47  54235
2 1949  5.28  4.82    5.20  7.55 11.11   10.26  67567
3 1950  4.20  3.77    3.67  9.52 12.20   11.35  66161
4 1951  4.60  4.46    3.93 11.14 15.15   11.13  68094
5 1952  7.15  4.99    4.88 16.34 20.05   22.81 107080
6 1953  9.70  5.65    4.91  8.88  8.15    7.41  67594

Code

water$Year <-as.factor(water$Year)
levels(water$Year)

 [1] "1948" "1949" "1950" "1951" "1952" "1953" "1954" "1955" "1956" "1957"
[11] "1958" "1959" "1960" "1961" "1962" "1963" "1964" "1965" "1966" "1967"
[21] "1968" "1969" "1970" "1971" "1972" "1973" "1974" "1975" "1976" "1977"
[31] "1978" "1979" "1980" "1981" "1982" "1983" "1984" "1985" "1986" "1987"
[41] "1988" "1989" "1990"

Code

pairs(subset (water, select = c(APMAM : BSAAM)))

From the scatter plot matrix above I can see that some sources of water well correlated to each other, as their scatter plots are very close to a straight line. Moreover, there are two groups of water sources that demonstrate being inter-connected. The first group is APMAM, APSAB and APSLAKE. The second group is OPBPC, OPRC, OPLAKE and BSAAM. It is possible that the sites within a grouop share a water source or separated from each other (may be by mountains).

Code

head(water)

  Year APMAM APSAB APSLAKE OPBPC  OPRC OPSLAKE  BSAAM
1 1948  9.13  3.58    3.91  4.10  7.43    6.47  54235
2 1949  5.28  4.82    5.20  7.55 11.11   10.26  67567
3 1950  4.20  3.77    3.67  9.52 12.20   11.35  66161
4 1951  4.60  4.46    3.93 11.14 15.15   11.13  68094
5 1952  7.15  4.99    4.88 16.34 20.05   22.81 107080
6 1953  9.70  5.65    4.91  8.88  8.15    7.41  67594

Code

pairs(subset (water, select = c(APMAM : APSLAKE)))

Code

head(water)

  Year APMAM APSAB APSLAKE OPBPC  OPRC OPSLAKE  BSAAM
1 1948  9.13  3.58    3.91  4.10  7.43    6.47  54235
2 1949  5.28  4.82    5.20  7.55 11.11   10.26  67567
3 1950  4.20  3.77    3.67  9.52 12.20   11.35  66161
4 1951  4.60  4.46    3.93 11.14 15.15   11.13  68094
5 1952  7.15  4.99    4.88 16.34 20.05   22.81 107080
6 1953  9.70  5.65    4.91  8.88  8.15    7.41  67594

Code

pairs(subset (water, select = c(OPBPC : BSAAM)))

Question 4

Professor ratings (Data file: Rateprof in alr4) In the website and online forum RateMyProfessors.com, students rate and comment on their instructors. Launched in 1999, the site includes millions of ratings on thousands of instructors. The data file includes the summaries of the ratings of 364 instructors at a large campus in the Midwest (Bleske-Rechek and Fritsch, 2011). Each instructor included in the data had at least 10 ratings over a several year period. Students provided ratings of 1–5 on quality, helpfulness, clarity, easiness of instructor’s courses, and rater Interest in the subject matter covered in the instructor’s courses. The data file provides the averages of these five ratings. Create a scatter plot matrix of these five variables. Provide a brief description of the relationships between the five ratings.

Code

data(Rateprof)
str(Rateprof)

'data.frame':   366 obs. of  17 variables:
 $ gender         : Factor w/ 2 levels "female","male": 2 2 2 2 2 2 2 2 2 2 ...
 $ numYears       : int  7 6 10 11 11 10 7 11 11 7 ...
 $ numRaters      : int  11 11 43 24 19 15 17 16 12 18 ...
 $ numCourses     : int  5 5 2 5 7 9 3 3 4 4 ...
 $ pepper         : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ discipline     : Factor w/ 4 levels "Hum","SocSci",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ dept           : Factor w/ 48 levels "Accounting","Anthropology",..: 17 42 3 17 45 45 45 17 34 17 ...
 $ quality        : num  4.64 4.32 4.79 4.25 4.68 ...
 $ helpfulness    : num  4.64 4.55 4.72 4.46 4.68 ...
 $ clarity        : num  4.64 4.09 4.86 4.04 4.68 ...
 $ easiness       : num  4.82 4.36 4.6 2.79 4.47 ...
 $ raterInterest  : num  3.55 4 3.43 3.18 4.21 ...
 $ sdQuality      : num  0.552 0.902 0.453 0.933 0.65 ...
 $ sdHelpfulness  : num  0.674 0.934 0.666 0.932 0.82 ...
 $ sdClarity      : num  0.505 0.944 0.413 0.999 0.582 ...
 $ sdEasiness     : num  0.405 0.505 0.541 0.588 0.612 ...
 $ sdRaterInterest: num  1.128 1.074 1.237 1.332 0.975 ...

Code

head(Rateprof)

  gender numYears numRaters numCourses pepper discipline              dept
1   male        7        11          5     no        Hum           English
2   male        6        11          5     no        Hum Religious Studies
3   male       10        43          2     no        Hum               Art
4   male       11        24          5     no        Hum           English
5   male       11        19          7     no        Hum           Spanish
6   male       10        15          9     no        Hum           Spanish
   quality helpfulness  clarity easiness raterInterest sdQuality sdHelpfulness
1 4.636364    4.636364 4.636364 4.818182      3.545455 0.5518564     0.6741999
2 4.318182    4.545455 4.090909 4.363636      4.000000 0.9020179     0.9341987
3 4.790698    4.720930 4.860465 4.604651      3.432432 0.4529343     0.6663898
4 4.250000    4.458333 4.041667 2.791667      3.181818 0.9325048     0.9315329
5 4.684211    4.684211 4.684211 4.473684      4.214286 0.6500112     0.8200699
6 4.233333    4.266667 4.200000 4.533333      3.916667 0.8632717     1.0327956
  sdClarity sdEasiness sdRaterInterest
1 0.5045250  0.4045199       1.1281521
2 0.9438798  0.5045250       1.0744356
3 0.4129681  0.5407021       1.2369438
4 0.9990938  0.5882300       1.3322506
5 0.5823927  0.6117753       0.9749613
6 0.7745967  0.6399405       0.6685579

Code

pairs(subset (Rateprof, select =c(quality : raterInterest)))

As we can see from the scatter plot above, three variables are clearly highly correlated: quality, helpfulness and clarity. Particularly, quality highly correlated with both helpfulness and clarity. I suspect, these variables might create multicollinearity problem if used together in multiple regression model.

Connection between clarity and helpfulness is also very distinct, but the variability seems bigger.

Easiness variable appears positively correlated with the first three (quality, helpfulness and clarity), however there is much higher variability.

Rater’s interest appears less connected with the other variables, and might demonstrate a smaller slope of a regression line when calculated.

Question 5

For the student.survey data file in the smss package, conduct regression analyses relating (by convention, y denotes the outcome variable, x denotes the explanatory variable)

(i) y = political ideology and x = religiosity,

(ii) y = high school GPA and x = hours of TV watching. (You can use ?student.survey in the R console, after loading the package, to see what each variable means.)

(a) Graphically portray how the explanatory variable relates to the outcome variable in each of the two cases

(b) Summarize and interpret results of inferential analyses.

i. y = political ideology and x = religiosity

By using ?student.survey function, I found out that : Political ideology is stored as PI variable, Religiosity doesn’t have its own variable, but can be represented as “how often you attend religious services” from this data set, variable RE

Therefore , my model can be presented as

\[ PI =\beta_o +\beta_1*RE \]

Code

data(student.survey)
?student.survey

starting httpd help server ... done

Code

str(student.survey)

'data.frame':   60 obs. of  18 variables:
 $ subj: int  1 2 3 4 5 6 7 8 9 10 ...
 $ ge  : Factor w/ 2 levels "f","m": 2 1 1 1 2 2 2 1 2 2 ...
 $ ag  : int  32 23 27 35 23 39 24 31 34 28 ...
 $ hi  : num  2.2 2.1 3.3 3.5 3.1 3.5 3.6 3 3 4 ...
 $ co  : num  3.5 3.5 3 3.2 3.5 3.5 3.7 3 3 3.1 ...
 $ dh  : int  0 1200 1300 1500 1600 350 0 5000 5000 900 ...
 $ dr  : num  5 0.3 1.5 8 10 3 0.2 1.5 2 2 ...
 $ tv  : num  3 15 0 5 6 4 5 5 7 1 ...
 $ sp  : int  5 7 4 5 6 5 12 3 5 1 ...
 $ ne  : int  0 5 3 6 3 7 4 3 3 2 ...
 $ ah  : int  0 6 0 3 0 0 2 1 0 1 ...
 $ ve  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ pa  : Factor w/ 3 levels "d","i","r": 3 1 1 2 2 1 2 2 2 2 ...
 $ pi  : Ord.factor w/ 7 levels "very liberal"<..: 6 2 2 4 1 2 2 2 1 3 ...
 $ re  : Ord.factor w/ 4 levels "never"<"occasionally"<..: 3 2 3 2 1 2 2 2 2 1 ...
 $ ab  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ aa  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ ld  : logi  FALSE NA NA FALSE FALSE NA ...

Code

# head(student.survey)
class(student.survey$pi)

[1] "ordered" "factor"

Code

levels(student.survey$pi)

[1] "very liberal"          "liberal"               "slightly liberal"     
[4] "moderate"              "slightly conservative" "conservative"         
[7] "very conservative"

Code

class(student.survey$re)

[1] "ordered" "factor"

We can see that both variables are ordered factor variables. We will need to convert factor variables into the numeric variables for regression, and then use them in linear model:

Code

student.survey$pi <-as.numeric (student.survey$pi)
student.survey$re <-as.numeric (student.survey$re)
summary(lm(pi~re, data=student.survey))


Call:
lm(formula = pi ~ re, data = student.survey)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.81243 -0.87160  0.09882  1.12840  3.09882 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.9308     0.4252   2.189   0.0327 *  
re            0.9704     0.1792   5.416 1.22e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.345 on 58 degrees of freedom
Multiple R-squared:  0.3359,    Adjusted R-squared:  0.3244 
F-statistic: 29.34 on 1 and 58 DF,  p-value: 1.221e-06

Code

ggplot(data = student.survey, aes(x = re, y = pi)) +
  geom_point() +
  geom_smooth(method = 'lm', se=F)+
  labs(title="Political ideology and religiosity", x="Frequency of going to church", y = "level of consrvatism")

`geom_smooth()` using formula = 'y ~ x'

Based on the summary above, we can see that religiosity and political ideology are positively related, but the religiosity only contributes about 30 % of political ideology (R^2 =0.34). Overal w can say that higher level of conservatism can be somewhat (30%) predicted by frequency of church attendance.

ii. y = high school GPA and x = hours of TV watching

High school GPA is presented as HI variable, and hours of TV watching - by TV variable.

Code

class(student.survey$hi)

[1] "numeric"

Code

class(student.survey$tv)

[1] "numeric"

Both variables are numeric, which allows us to run a simple linear regression:

Code

summary(lm(data=student.survey, hi ~tv))


Call:
lm(formula = hi ~ tv, data = student.survey)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.2583 -0.2456  0.0417  0.3368  0.7051 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.441353   0.085345  40.323   <2e-16 ***
tv          -0.018305   0.008658  -2.114   0.0388 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4467 on 58 degrees of freedom
Multiple R-squared:  0.07156,   Adjusted R-squared:  0.05555 
F-statistic: 4.471 on 1 and 58 DF,  p-value: 0.03879

We can see that the the slope of Independent variable (tv) is very small and negative. It’s statistical significance is also very low and depends on alpha level that we select. R squared, which indicates the proportion of the variance that can be explained by x (tv watching time), is very low (0.072). Taking this into account, I conclude there is no contribution of TV watching to GPA level.

To visualize these relationship, we can build a scatter plot and a fitted linear line:

Code

ggplot (data=student.survey, mapping=aes(x=tv, y=hi))+
  geom_point()+
  stat_smooth(method = "lm", se = T)+
  labs(title="Sctterplot between hours of TV watching and high school GPA", y="High school GPA", x="hours of TV watching")

`geom_smooth()` using formula = 'y ~ x'

From the graph above we can see, that the standard error band is getting wider towards the right side of the chart, due to low amount of observations in that part of the sample. It also adds uncertainty to a conclusion and calls for more data and further investigation. Since the overall slope does go down, I would suggest that there may be another factor that contributes to the relationship between TV watching and GPA.