Homework 3

hw3

Author

Young Soo Choi

Published

April 11, 2023

Question 1

Code

# data loading
library(alr4)

Warning: package 'alr4' was built under R version 4.2.3

Loading required package: car

Loading required package: carData

Loading required package: effects

Warning: package 'effects' was built under R version 4.2.3

lattice theme set by effectsTheme()
See ?effectsTheme for details.

Code

data(UN11)

# check data
head(UN11)

                region  group fertility   ppgdp lifeExpF pctUrban
Afghanistan       Asia  other     5.968   499.0    49.49       23
Albania         Europe  other     1.525  3677.2    80.40       53
Algeria         Africa africa     2.142  4473.0    75.00       67
Angola          Africa africa     5.135  4321.9    53.17       59
Anguilla     Caribbean  other     2.000 13750.1    81.10      100
Argentina   Latin Amer  other     2.172  9162.1    79.89       93

Code

dim(UN11)

[1] 199   6

(a)

Through the given problem, it can be seen that the research topic is the effect of ppgdp on the birth rate. In this research topic, the predictor is ‘ppgdp’ and the response is ‘fertility’.

(b)

First of all, a regression model between the ppgdp and fertility of countries was derived.

Code

summary(lm(UN11$fertility~UN11$ppgdp))


Call:
lm(formula = UN11$fertility ~ UN11$ppgdp)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.9006 -0.8801 -0.3547  0.6749  3.7585 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.178e+00  1.048e-01  30.331  < 2e-16 ***
UN11$ppgdp  -3.201e-05  4.655e-06  -6.877  7.9e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.206 on 197 degrees of freedom
Multiple R-squared:  0.1936,    Adjusted R-squared:  0.1895 
F-statistic: 47.29 on 1 and 197 DF,  p-value: 7.903e-11

A regression model such as Fertility = 3.178-0.00003*ppGDP was derived. In other words, a single unit increase in ppgdp reduces the fertility rate by 0.00003.

Now, expressing this as a scatterplot is as follows.

Code

library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.1     ✔ purrr   1.0.1
✔ tibble  3.1.8     ✔ dplyr   1.1.0
✔ tidyr   1.3.0     ✔ stringr 1.5.0
✔ readr   2.1.4     ✔ forcats 1.0.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
✖ dplyr::recode() masks car::recode()
✖ purrr::some()   masks car::some()

Code

options(scipen=999)
plot(UN11$ppgdp, UN11$fertility, 
     xlab="ppGDP",
     ylab="Fertility",
     col="cornflowerblue")
abline(lm(UN11$fertility~UN11$ppgdp), col="blue")
text(mean(UN11$ppgdp)+30000, mean(UN11$fertility)+0.5, "Fertility = 3.178-0.00003*ppGDP", col = "blue")

Overall, the regression equation derived earlier seems reasonable at first glance as it shows a downward trend to the right, but it can also be seen that the distribution of ppgdp is quite biased to the right.

Code

mean(UN11$ppgdp)

[1] 13011.95

Code

median(UN11$ppgdp)

[1] 4684.5

This can also be seen from the comparison of the mean and the median, and the mean of ppgdp is 13011.95 and the median is 4684.5, indicating that the distribution of ppgdp is biased to the right. Therefore, a simple linear regression model that does not properly convert variables will have many limitations in explaining the variability of fertility.

(c)

First, a new objects with a value obtained by logarithms each variable is generated.

Code

UN11$log_ppgdp<-log(UN11$ppgdp)
UN11$log_fertility<-log(UN11$fertility)

Using these objects, the regression model was obtained in the same way as the problem (b) and a scatterplot was drawn.

Code

summary(lm(UN11$log_fertility~UN11$log_ppgdp))


Call:
lm(formula = UN11$log_fertility ~ UN11$log_ppgdp)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.79828 -0.21639  0.02669  0.23424  0.95596 

Coefficients:
               Estimate Std. Error t value            Pr(>|t|)    
(Intercept)     2.66551    0.12057   22.11 <0.0000000000000002 ***
UN11$log_ppgdp -0.20715    0.01401  -14.79 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3071 on 197 degrees of freedom
Multiple R-squared:  0.526, Adjusted R-squared:  0.5236 
F-statistic: 218.6 on 1 and 197 DF,  p-value: < 0.00000000000000022

Code

plot(UN11$log_ppgdp, UN11$log_fertility, 
     xlab="Log(ppGDP)",
     ylab="Log(Fertility)",
     col="chartreuse4")
abline(lm(UN11$log_fertility~UN11$log_ppgdp), col="darkgreen")
text(mean(UN11$log_ppgdp+1.2), mean(UN11$log_fertility+0.8), "Log(Fertility) = 2.666-0.2072*Log(ppGDP)", col = "darkgreen")

A regression line that reflects the data much better than when a regression line was derived without transforming variables was derived. The R-squared value is also higher than before transform.(0.19 -> 0.53) In other words, the performance of the regression model improved by logarithms each variable.

Code

# change base of logarithms
UN11$log10_ppgdp<-log(UN11$ppgdp, base=exp(10))
UN11$log10_fertility<-log(UN11$fertility, base=exp(10))

summary(lm(UN11$log10_fertility~UN11$log10_ppgdp))


Call:
lm(formula = UN11$log10_fertility ~ UN11$log10_ppgdp)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.079828 -0.021639  0.002669  0.023424  0.095596 

Coefficients:
                 Estimate Std. Error t value            Pr(>|t|)    
(Intercept)       0.26655    0.01206   22.11 <0.0000000000000002 ***
UN11$log10_ppgdp -0.20715    0.01401  -14.79 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.03071 on 197 degrees of freedom
Multiple R-squared:  0.526, Adjusted R-squared:  0.5236 
F-statistic: 218.6 on 1 and 197 DF,  p-value: < 0.00000000000000022

Code

plot(UN11$log10_ppgdp, UN11$log10_fertility, 
     xlab="Log10(ppGDP)",
     ylab="Log10(Fertility)",
     col="gray")
abline(lm(UN11$log10_fertility~UN11$log10_ppgdp), col="black")
text(mean(UN11$log10_ppgdp), mean(UN11$log10_fertility+0.07), "Log10(Fertility) = 0.2666-0.2072*Log10(ppGDP)", col = "black")

When base of logarithms changed to 10, there are no changes in distribution of data and shape of line. But scales of each axes are changed.

Question 2

Code

UN11$pound_ppgdp<-UN11$ppgdp/1.33

(a)

Code

summary(lm(UN11$fertility~UN11$pound_ppgdp))


Call:
lm(formula = UN11$fertility ~ UN11$pound_ppgdp)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.9006 -0.8801 -0.3547  0.6749  3.7585 

Coefficients:
                     Estimate   Std. Error t value             Pr(>|t|)    
(Intercept)       3.177911642  0.104772778  30.331 < 0.0000000000000002 ***
UN11$pound_ppgdp -0.000042575  0.000006191  -6.877       0.000000000079 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.206 on 197 degrees of freedom
Multiple R-squared:  0.1936,    Adjusted R-squared:  0.1895 
F-statistic: 47.29 on 1 and 197 DF,  p-value: 0.00000000007903

Code

plot(UN11$pound_ppgdp, UN11$fertility, 
     xlab="ppGDP(£)",
     ylab="Fertility",
     col="cornflowerblue")
abline(lm(UN11$fertility~UN11$pound_ppgdp), col="blue")
text(mean(UN11$pound_ppgdp)+30000, mean(UN11$fertility)+0.5, "Fertility = 3.178-0.00004*ppGDP(£)", col = "blue")

When dollars are converted into pounds, the slope of the simple regression line changes.(from approx. 0.00003 to approx. 0.00004). But the intercept does not change.

(b)

Even if the monetary unit is changed, the r-squared value remains unchanged (equivalent to 0.1936). That is, there is no change in the ratio at which the change in ppgdp explains the change in the facility. In the problem (a), the change in slope is simply caused by the unit conversion of the x variable. In fact, applying the exchange rate of 1.33, which is the changed slope, shows that it is the same as the slope calculated in dollars.

Code

0.000042575/1.33

[1] 0.00003201128

Question 3

Code

data(water)
head(water)

  Year APMAM APSAB APSLAKE OPBPC  OPRC OPSLAKE  BSAAM
1 1948  9.13  3.58    3.91  4.10  7.43    6.47  54235
2 1949  5.28  4.82    5.20  7.55 11.11   10.26  67567
3 1950  4.20  3.77    3.67  9.52 12.20   11.35  66161
4 1951  4.60  4.46    3.93 11.14 15.15   11.13  68094
5 1952  7.15  4.99    4.88 16.34 20.05   22.81 107080
6 1953  9.70  5.65    4.91  8.88  8.15    7.41  67594

Code

dim(water)

[1] 43  8

Code

summary(lm(BSAAM~APMAM+APSAB+APSLAKE+OPBPC+OPRC+OPSLAKE,water))


Call:
lm(formula = BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + 
    OPSLAKE, data = water)

Residuals:
   Min     1Q Median     3Q    Max 
-12690  -4936  -1424   4173  18542 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 15944.67    4099.80   3.889 0.000416 ***
APMAM         -12.77     708.89  -0.018 0.985725    
APSAB        -664.41    1522.89  -0.436 0.665237    
APSLAKE      2270.68    1341.29   1.693 0.099112 .  
OPBPC          69.70     461.69   0.151 0.880839    
OPRC         1916.45     641.36   2.988 0.005031 ** 
OPSLAKE      2211.58     752.69   2.938 0.005729 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7557 on 36 degrees of freedom
Multiple R-squared:  0.9248,    Adjusted R-squared:  0.9123 
F-statistic: 73.82 on 6 and 36 DF,  p-value: < 0.00000000000000022

Code

pairs(water[,2:8])

Code

cor(water[,2:8])

            APMAM      APSAB    APSLAKE      OPBPC      OPRC    OPSLAKE
APMAM   1.0000000 0.82768637 0.81607595 0.12238567 0.1544155 0.10754212
APSAB   0.8276864 1.00000000 0.90030474 0.03954211 0.1056396 0.02961175
APSLAKE 0.8160760 0.90030474 1.00000000 0.09344773 0.1063836 0.10058669
OPBPC   0.1223857 0.03954211 0.09344773 1.00000000 0.8647073 0.94334741
OPRC    0.1544155 0.10563959 0.10638359 0.86470733 1.0000000 0.91914467
OPSLAKE 0.1075421 0.02961175 0.10058669 0.94334741 0.9191447 1.00000000
BSAAM   0.2385695 0.18329499 0.24934094 0.88574778 0.9196270 0.93843604
            BSAAM
APMAM   0.2385695
APSAB   0.1832950
APSLAKE 0.2493409
OPBPC   0.8857478
OPRC    0.9196270
OPSLAKE 0.9384360
BSAAM   1.0000000

A multiple regression model was derived to find out the relationship between the annual precipitation of the six sites presented in the problem and the stream runoff volume.

It was analyzed that 92.5% of the stream runoff volume fluctuation could be explained through the precipitation variation at six sites(R-squared is 0.9248). However, this analysis may overlook the problem of overfitting, an increase in explanatory power due to an increase in the number of variables.

There are also variables that show a fairly high correlation between the variables included in the model. It should be considered that these variables can also distort the explanatory power of the model.

In addition, it should be considered that a small number of extreme data can have a distorted effect on the entire data.

Overall, stream runoff volume can be predicted to some extent using various variables, but the relationship between variables and the existence of extreme values should be considered when creating this prediction model.

Question 4

Code

data("Rateprof")
head(Rateprof)

  gender numYears numRaters numCourses pepper discipline              dept
1   male        7        11          5     no        Hum           English
2   male        6        11          5     no        Hum Religious Studies
3   male       10        43          2     no        Hum               Art
4   male       11        24          5     no        Hum           English
5   male       11        19          7     no        Hum           Spanish
6   male       10        15          9     no        Hum           Spanish
   quality helpfulness  clarity easiness raterInterest sdQuality sdHelpfulness
1 4.636364    4.636364 4.636364 4.818182      3.545455 0.5518564     0.6741999
2 4.318182    4.545455 4.090909 4.363636      4.000000 0.9020179     0.9341987
3 4.790698    4.720930 4.860465 4.604651      3.432432 0.4529343     0.6663898
4 4.250000    4.458333 4.041667 2.791667      3.181818 0.9325048     0.9315329
5 4.684211    4.684211 4.684211 4.473684      4.214286 0.6500112     0.8200699
6 4.233333    4.266667 4.200000 4.533333      3.916667 0.8632717     1.0327956
  sdClarity sdEasiness sdRaterInterest
1 0.5045250  0.4045199       1.1281521
2 0.9438798  0.5045250       1.0744356
3 0.4129681  0.5407021       1.2369438
4 0.9990938  0.5882300       1.3322506
5 0.5823927  0.6117753       0.9749613
6 0.7745967  0.6399405       0.6685579

Code

dim(Rateprof)

[1] 366  17

Code

pairs(Rateprof[,8:12])

Code

cor(Rateprof[,8:12])

                quality helpfulness   clarity  easiness raterInterest
quality       1.0000000   0.9810314 0.9759608 0.5651154     0.4706688
helpfulness   0.9810314   1.0000000 0.9208070 0.5635184     0.4630321
clarity       0.9759608   0.9208070 1.0000000 0.5358884     0.4611408
easiness      0.5651154   0.5635184 0.5358884 1.0000000     0.2052237
raterInterest 0.4706688   0.4630321 0.4611408 0.2052237     1.0000000

First of all, it is easy to see that these three variables(quality, helpfulness, clarity) are highly correlated with each other and show a clear linear relationship.(All correlation coefficient 0.9 or higher)

Next, in the case of easiness, it shows a some positive correlation with the above three variables(correlation coefficient about 0.5). However, the correlation with the raterInterest appears weaker than them. In other words, it shows a weak positive correlation.

Finally, the rateInterest has a positive correlation coefficient of about 0.4 with the previous three variables(quality, helpfulness, clarity), and shows a positive correlation coefficient of about 0.2 with easiness.

In summary, since quality, helpfulness, clarity has a very strong correlation with each other, it is necessary to consider the method of excluding some variables in consideration of the advantages of statistical analysis including these variables.

Question 5

Code

library(smss)

Warning: package 'smss' was built under R version 4.2.3

Code

data(student.survey)

head(student.survey)

  subj ge ag  hi  co   dh   dr tv sp ne ah    ve pa           pi           re
1    1  m 32 2.2 3.5    0  5.0  3  5  0  0 FALSE  r conservative   most weeks
2    2  f 23 2.1 3.5 1200  0.3 15  7  5  6 FALSE  d      liberal occasionally
3    3  f 27 3.3 3.0 1300  1.5  0  4  3  0 FALSE  d      liberal   most weeks
4    4  f 35 3.5 3.2 1500  8.0  5  5  6  3 FALSE  i     moderate occasionally
5    5  m 23 3.1 3.5 1600 10.0  6  6  3  0 FALSE  i very liberal        never
6    6  m 39 3.5 3.5  350  3.0  4  5  7  0 FALSE  d      liberal occasionally
     ab    aa    ld
1 FALSE FALSE FALSE
2 FALSE FALSE    NA
3 FALSE FALSE    NA
4 FALSE FALSE FALSE
5 FALSE FALSE FALSE
6 FALSE FALSE    NA

Code

dim(student.survey)

[1] 60 18

(i-a)

Looking at the data first, it is composed of nominal variables.

Code

table(student.survey$pi)


         very liberal               liberal      slightly liberal 
                    8                    24                     6 
             moderate slightly conservative          conservative 
                   10                     6                     4 
    very conservative 
                    2

Code

table(student.survey$re)


       never occasionally   most weeks   every week 
          15           29            7            9

Regression analysis such as logistic regression can also be performed for nominal variables. Here, regression analysis will be performed simply by assigning a number corresponding to each variable value. The level of the variable is a orderal variable, and in the case of religion, the higher the number, the more participation in religious activities, and in the case of political ideology, the higher the number, the more conservative it was.

Code

# transform variables
student.survey$pol_id<-
  ifelse(student.survey[,"pi"]=="very liberal", 1,
       ifelse(student.survey[,"pi"]=="liberal", 2,
              ifelse(student.survey[,"pi"]=="slightly liberal", 3,
                     ifelse(student.survey[,"pi"]=="moderate", 4,
                            ifelse(student.survey[,"pi"]=="slightly conservative", 5,
                                   ifelse(student.survey[,"pi"]=="conservative", 6, 7))))))


student.survey$rel_fre<-
  ifelse(student.survey[,"re"]=="never",1,
         ifelse(student.survey[,"re"]=="occasionally", 2,
                ifelse(student.survey[,"re"]=="most weeks", 3, 4)))

Code

# regression model
summary(lm(student.survey$pol_id~student.survey$rel_fre))


Call:
lm(formula = student.survey$pol_id ~ student.survey$rel_fre)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.81243 -0.87160  0.09882  1.12840  3.09882 

Coefficients:
                       Estimate Std. Error t value   Pr(>|t|)    
(Intercept)              0.9308     0.4252   2.189     0.0327 *  
student.survey$rel_fre   0.9704     0.1792   5.416 0.00000122 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.345 on 58 degrees of freedom
Multiple R-squared:  0.3359,    Adjusted R-squared:  0.3244 
F-statistic: 29.34 on 1 and 58 DF,  p-value: 0.000001221

Code

# plot
plot(student.survey$rel_fre, student.survey$pol_id, xaxt = 'n', yaxt='n',
     xlab="Religiosity",
     ylab="Political Ideology",
     col="cornflowerblue")
axis(1, at = seq(1, 4, by = 1), labels = c("never", "occasionally", "most weeks", "every weeks"))
axis(2, at = seq(1,7, by=1), labels = c("very liberal", "liberal", "slightly liberal", "moderate", "slightly conservative", "conservative", "very conservative"))
abline(lm(student.survey$pol_id~student.survey$rel_fre), col="blue")

(i-b)

The effect of religiosity on political ideology shows a positive correlation. In other words, the more often you participate in religious activities, the more conservative your political ideology becomes, and the more you do not participate in religious activities, the more liberal your political ideology tends to become.

(ii-a)

Code

# regression model
summary(lm(student.survey$hi~student.survey$tv))


Call:
lm(formula = student.survey$hi ~ student.survey$tv)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.2583 -0.2456  0.0417  0.3368  0.7051 

Coefficients:
                   Estimate Std. Error t value            Pr(>|t|)    
(Intercept)        3.441353   0.085345  40.323 <0.0000000000000002 ***
student.survey$tv -0.018305   0.008658  -2.114              0.0388 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4467 on 58 degrees of freedom
Multiple R-squared:  0.07156,   Adjusted R-squared:  0.05555 
F-statistic: 4.471 on 1 and 58 DF,  p-value: 0.03879

Code

# plot
plot(student.survey$tv, student.survey$hi,
     xlab="hours of TV watching",
     ylab="high school GPA",
     col="cornflowerblue")
abline(lm(student.survey$hi~student.survey$tv), col="blue")

(ii-b)

As shown in the figure, there is a weak negative correlation, However, size of effect is small(R-squared is 0.0715). In this case, there could be a variable that distorts the size or direction of the effect between the variable of watching TV and performance. Therefore, it is necessary to further analyze the relationship with other variables.