hw2
confidence intervals
DACSS603_Homework 2
Author

Meredith Derian-Toth

Published

April 2, 2023

library(“alr4”)

library(“smss”)

Code
library("alr4")
Loading required package: car
Loading required package: carData
Loading required package: effects
lattice theme set by effectsTheme()
See ?effectsTheme for details.
Code
library("smss")
library("ggplot2")
        
data(UN11)
head(UN11)
                region  group fertility   ppgdp lifeExpF pctUrban
Afghanistan       Asia  other     5.968   499.0    49.49       23
Albania         Europe  other     1.525  3677.2    80.40       53
Algeria         Africa africa     2.142  4473.0    75.00       67
Angola          Africa africa     5.135  4321.9    53.17       59
Anguilla     Caribbean  other     2.000 13750.1    81.10      100
Argentina   Latin Amer  other     2.172  9162.1    79.89       93

Question 1:

Graph ppgdp with fertility and transform the data

a) the predictor variable is the independent variable, this is per person gross national product (ppgdp). And the response is the dependent variable, this is fertility.

Code
ggplot(data = UN11, aes(x = ppgdp, y = fertility)) +
  geom_point() +
  geom_smooth(method = 'lm', se=F)
`geom_smooth()` using formula = 'y ~ x'

b) The scatterplot above shows the level of fertility based on per person gdp across different countries. What we can see is that as the gdp person increases, the level of fertility decreases, until it reaches a specific point, and then it stays pretty level.

Code
ggplot(data = UN11, aes(x = log(ppgdp), y = log(fertility))) +
  geom_point() +
  geom_smooth()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

c) Now that we have transformed this data into a logarithm, a simple linear regression does seem plausible to summarize the graph.

Question 2:

a) When the independent variable is increased by 1.33, the slope of the prediction will increase by 1.33.

b) When the independent variable is increased by 1.33, the correlation will not change. The correlation is the standardized version of the slope and does not depend on unit of measurement.

Question 3:

Can Southern California’s water supply in future years be predicted from past data?

Code
data(water)
head(water)
  Year APMAM APSAB APSLAKE OPBPC  OPRC OPSLAKE  BSAAM
1 1948  9.13  3.58    3.91  4.10  7.43    6.47  54235
2 1949  5.28  4.82    5.20  7.55 11.11   10.26  67567
3 1950  4.20  3.77    3.67  9.52 12.20   11.35  66161
4 1951  4.60  4.46    3.93 11.14 15.15   11.13  68094
5 1952  7.15  4.99    4.88 16.34 20.05   22.81 107080
6 1953  9.70  5.65    4.91  8.88  8.15    7.41  67594
Code
ggplot(data = water, aes(x = Year, y = BSAAM)) +
  geom_point() +
  geom_smooth()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Due to the shape of this relationship, it is best to transform the data.

Code
pairs(water)

What we see in the scatterplot matrix above is a relationship between locations, but not necesarily location and year (looking at the top row of scatterplots). Therefore further analysis is necessary in order to attempt a prediction of water runoff.

Question 4:

Professor Ratings Overtime

Code
data("Rateprof")
pairs(~ quality + helpfulness + clarity + easiness + raterInterest, data=Rateprof)

From the scatterplot matrix above we can see that there is a positivie relationship between professor rating and quality of professor, helpfulness of professor, and their clarity. However the professor rating is not related to the easiness of their class or the rater interest in the material.

Question 5:

Regression Analysis

Code
data("student.survey")
head(student.survey)
  subj ge ag  hi  co   dh   dr tv sp ne ah    ve pa           pi           re
1    1  m 32 2.2 3.5    0  5.0  3  5  0  0 FALSE  r conservative   most weeks
2    2  f 23 2.1 3.5 1200  0.3 15  7  5  6 FALSE  d      liberal occasionally
3    3  f 27 3.3 3.0 1300  1.5  0  4  3  0 FALSE  d      liberal   most weeks
4    4  f 35 3.5 3.2 1500  8.0  5  5  6  3 FALSE  i     moderate occasionally
5    5  m 23 3.1 3.5 1600 10.0  6  6  3  0 FALSE  i very liberal        never
6    6  m 39 3.5 3.5  350  3.0  4  5  7  0 FALSE  d      liberal occasionally
     ab    aa    ld
1 FALSE FALSE FALSE
2 FALSE FALSE    NA
3 FALSE FALSE    NA
4 FALSE FALSE FALSE
5 FALSE FALSE FALSE
6 FALSE FALSE    NA
Code
#regression analysis i

student_survey_reg_i<-lm(as.numeric(pi) ~ as.numeric(re), data = student.survey)
summary(student_survey_reg_i)

Call:
lm(formula = as.numeric(pi) ~ as.numeric(re), data = student.survey)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.81243 -0.87160  0.09882  1.12840  3.09882 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)      0.9308     0.4252   2.189   0.0327 *  
as.numeric(re)   0.9704     0.1792   5.416 1.22e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.345 on 58 degrees of freedom
Multiple R-squared:  0.3359,    Adjusted R-squared:  0.3244 
F-statistic: 29.34 on 1 and 58 DF,  p-value: 1.221e-06
Code
ggplot(student.survey, aes(re, fill = pi)) +
  geom_bar(position = "fill")

Above we can see the relationship between religiosity and political ideology. We see that the students who attend church more regularly also identify more as conservatives than liberals. According to the regression analysis, the relationship between political ideology and religiosity is significant with a p value very close to 0. From this we can see that as religiosity increases, political ideology leans conservative.

Code
#regression analysis (ii)

student_survey_reg_ii <- lm(hi ~ tv, data = student.survey)
summary(student_survey_reg_ii)

Call:
lm(formula = hi ~ tv, data = student.survey)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.2583 -0.2456  0.0417  0.3368  0.7051 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.441353   0.085345  40.323   <2e-16 ***
tv          -0.018305   0.008658  -2.114   0.0388 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4467 on 58 degrees of freedom
Multiple R-squared:  0.07156,   Adjusted R-squared:  0.05555 
F-statistic: 4.471 on 1 and 58 DF,  p-value: 0.03879
Code
ggplot(data = student.survey, aes(x = tv, y = hi)) +
  geom_point() +
  geom_smooth(method = 'lm', se=F)
`geom_smooth()` using formula = 'y ~ x'

Above we can see the relationship between average rate of tv watching per week and high school GPA. We see that the students who watch more TV on average have a lower GPS. According to the regression analysis, the relationship between average hours of tv watched per week and high school GPS is significant with a p value very close to 0.01. The adjusted R squared is quite low and the p value is not at a strong significance level. We can see that the relationship in the regression analysis above is stronger.