fp
research question
desriptive statistics
model
Final Project
Author

Guanhua Tan

Published

May 4, 2023

Code
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Code
library(ggplot2)
library(dbplyr)

Attaching package: 'dbplyr'

The following objects are masked from 'package:dplyr':

    ident, sql
Code
library(stargazer)

Please cite as: 

 Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
 R package version 5.2.3. https://CRAN.R-project.org/package=stargazer 
Code
library(misty)
|-------------------------------------|
| misty 0.4.9 (2023-05-02)            |
| Miscellaneous Functions T. Yanagida |
|-------------------------------------|
Code
library(reshape2)

Attaching package: 'reshape2'

The following object is masked from 'package:tidyr':

    smiths
Code
pisa <- read_csv('_data/CY07_MSU_SCH_QQQ.csv')
New names:
Rows: 21903 Columns: 198
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(8): CNT, CYC, NatCen, STRATUM, SUBNATIO, SC053D11TA, PRIVATESCH, VER_DAT dbl
(189): ...1, CNTRYID, CNTSCHID, Region, OECD, ADMINMODE, LANGTEST, SC001... lgl
(1): BOOKID
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`

Introduction

Research Questions

The Organization for Economic Co-operation and Development (OECD), founded in 1960, has conducted a constant series of surveys “Programme for International Student Assessment” (PISA) that evaluates 15-year-olds’ performance since 2000. The datasets obtained by these surveys have helped governments and education policymakers launch education reforms in an effort to “meet real-life challenges” (https://www.oecd.org/pisa/) in their countries. This paper explores the latest survey in 2018. Because the survey covers a wide range of issues regarding students’ performance and schools’ situations, my final project will narrow to investigate the relationship between career guidance and access to digital devices in schools.

Hpyotheis

I propose that the scores of access to digital devices may be the major factor to career guidance. Also, I propose that school types and urban settings may be cofunders.

  1. the scores of access to digital devices may be the major factor in career guidance
  2. OECD, school types, and urban settings may be co-founders.
Code
# tidy data
# create a data frame
#view(pisa)
# select related variable
pisa_selected <- dplyr::select(pisa, starts_with(c("SC161","SC155", "SC001Q01TA", "SC013Q01TA")))
pisa2018_joint <-cbind(pisa[, 1:12], pisa_selected)
# pisa_SC155
pisa2018_joint$Digitals=rowMeans(pisa2018_joint[,c("SC155Q01HA","SC155Q02HA",                                                  "SC155Q03HA","SC155Q04HA","SC155Q05HA","SC155Q06HA", "SC155Q07HA","SC155Q08HA","SC155Q09HA", "SC155Q10HA", "SC155Q11HA")])
pisa2018_joint$Career_Guidance=rowSums(pisa2018_joint[, c("SC161Q02SA","SC161Q03SA","SC161Q04SA","SC161Q04SA")])
pisa_SC155 <- pisa2018_joint %>%
  dplyr::select(CNT, STRATUM, OECD, Career_Guidance,Digitals, SC001Q01TA, SC013Q01TA) %>%
  mutate(Urban=SC001Q01TA, Public_or_Private=SC013Q01TA) %>%
  dplyr::select(-c(SC001Q01TA, SC013Q01TA)) %>%
  dplyr::select(c(CNT,STRATUM,OECD,Urban, Public_or_Private,Career_Guidance,Digitals))
pisa_SC155

Descriptive Statistics

This original OECD PISA 2018 School Questionnaire Dataset is one part of PISA 2018 dataset with a focus on schools. It covers 80 countries and regions all over the world. The dataset documents 21,903 schools’ responses regarding 187 questions. After cleaning the data, the dataset includes 7 variables: “CNT” identifies countries. “STRATUM” identifies schools. “OECD” indicates if a school locates in an OECD country or not. “Urban” describes different conditions of urban communities where a school locates. “Public_or_Private” presents if a school is public or private. “Career_Guidance” demonstrates scores the schools self-report in terms of career guidance. “Digitals” reflects schools’ access to digital devices.

Code
# summarizing and visualizing data
summary(pisa_SC155)
     CNT              STRATUM               OECD            Urban      
 Length:21903       Length:21903       Min.   :0.0000   Min.   :1.000  
 Class :character   Class :character   1st Qu.:0.0000   1st Qu.:2.000  
 Mode  :character   Mode  :character   Median :1.0000   Median :3.000  
                                       Mean   :0.5171   Mean   :3.007  
                                       3rd Qu.:1.0000   3rd Qu.:4.000  
                                       Max.   :1.0000   Max.   :5.000  
                                                        NA's   :1363   
 Public_or_Private Career_Guidance    Digitals    
 Min.   :1.00      Min.   :0.000   Min.   :1.000  
 1st Qu.:1.00      1st Qu.:1.000   1st Qu.:2.273  
 Median :1.00      Median :1.000   Median :2.636  
 Mean   :1.19      Mean   :1.518   Mean   :2.664  
 3rd Qu.:1.00      3rd Qu.:2.000   3rd Qu.:3.000  
 Max.   :2.00      Max.   :4.000   Max.   :4.000  
 NA's   :2092      NA's   :1499    NA's   :1399   
Code
pisa_SC155_boxplot<-pisa_SC155 %>%
  select(STRATUM, Career_Guidance, Digitals) %>% 
  pivot_longer(cols=c(Career_Guidance, Digitals), 
               names_to = "Group", values_to = "Evaluation")

ggplot(pisa_SC155_boxplot,aes(Evaluation, fill=Group))+
  stat_boxplot(geom = "errorbar", # Error bars
               width = 0.2)+
  geom_boxplot()+
  facet_wrap(~Group)+
  labs(title="Pisa2018 Evaluation")+
  coord_flip()
Warning: Removed 2898 rows containing non-finite values (`stat_boxplot()`).
Removed 2898 rows containing non-finite values (`stat_boxplot()`).

After using the summary function and visualization, I have already show the descriptive statistics. A large number of NA stands out. Now I will figure out how to deal with them properly.

Analysis

Discussion on NA

Code
# missing data
library(naniar)
pisa_SC115_nonOECD <- filter(pisa_SC155, OECD == "0")
pisa_SC115_OECD <- filter(pisa_SC155, OECD == "1")
vis_miss(pisa_SC155)

Code
vis_miss(pisa_SC115_nonOECD)

Code
vis_miss(pisa_SC115_OECD)

The graphics have disclose that compared with NON-OECD countries, OECD countries missed more data in terms of school type and career guidance. I believe that missing data are not caused by the poor technological condition in non-OECD countries. So missing data is random and can be dropped.

Model Evaluation on Linear Regression and Logistic Regression

Code
pisa_SC155_NoNA <- drop_na(pisa_SC155)
#summary(pisa_SC155_NoNA)

model_1 <-lm(Career_Guidance~Urban+Public_or_Private+Digitals+Urban*Digitals, data = pisa_SC155_NoNA)
#summary(model_1)

model_2 <-lm(Career_Guidance~Urban+Public_or_Private+Public_or_Private*Digitals, data = pisa_SC155_NoNA)
#summary(model_2)

model_3 <-lm(Career_Guidance~Urban+Public_or_Private+Digitals, data = pisa_SC155_NoNA)
#summary(model_3)

model_4 <-lm(Career_Guidance~Urban+Public_or_Private+OECD+Digitals, data = pisa_SC155_NoNA)

stargazer(model_1, model_2, model_3, model_4, type = 'text')

======================================================================================================================================
                                                                       Dependent variable:                                            
                           -----------------------------------------------------------------------------------------------------------
                                                                         Career_Guidance                                              
                                      (1)                        (2)                        (3)                        (4)            
--------------------------------------------------------------------------------------------------------------------------------------
Urban                                0.006                     0.047***                   0.047***                   0.045***         
                                    (0.025)                    (0.006)                    (0.006)                    (0.006)          
                                                                                                                                      
Public_or_Private                   0.134***                   0.297***                   0.137***                   0.140***         
                                    (0.019)                    (0.091)                    (0.019)                    (0.019)          
                                                                                                                                      
OECD                                                                                                                 0.262***         
                                                                                                                     (0.014)          
                                                                                                                                      
Digitals                            0.162***                   0.276***                   0.210***                   0.198***         
                                    (0.030)                    (0.039)                    (0.012)                    (0.012)          
                                                                                                                                      
Urban:Digitals                       0.016*                                                                                           
                                    (0.009)                                                                                           
                                                                                                                                      
Public_or_Private:Digitals                                     -0.056*                                                                
                                                               (0.031)                                                                
                                                                                                                                      
Constant                            0.788***                   0.473***                   0.660***                   0.559***         
                                    (0.083)                    (0.110)                    (0.037)                    (0.037)          
                                                                                                                                      
--------------------------------------------------------------------------------------------------------------------------------------
Observations                         18,563                     18,563                     18,563                     18,563          
R2                                   0.030                      0.030                      0.030                      0.047           
Adjusted R2                          0.030                      0.030                      0.029                      0.047           
Residual Std. Error            0.975 (df = 18558)         0.975 (df = 18558)         0.975 (df = 18559)         0.966 (df = 18558)    
F Statistic                142.347*** (df = 4; 18558) 142.420*** (df = 4; 18558) 188.786*** (df = 3; 18559) 229.446*** (df = 4; 18558)
======================================================================================================================================
Note:                                                                                                      *p<0.1; **p<0.05; ***p<0.01

After the linear regression model comparison, model_4 presents statistical significance on all independent variables. The career guidance depends on urban situations, school styles and the access to digital devices.

Code
# diagnostics
par(mfrow=c(2,3))
plot(model_4, which= 1:6)

Code
# logistic regression and AICs

pisa_SC155_NoNA$Career_Guidance_Ordinal <- as.factor(pisa_SC155_NoNA$Career_Guidance)
class(pisa_SC155_NoNA$Career_Guidance_Ordinal)
[1] "factor"
Code
table(pisa_SC155_NoNA$Career_Guidance_Ordinal)

   0    1    2    3    4 
2308 7982 5296 2231  746 
Code
fit2 <- MASS::polr(Career_Guidance_Ordinal~Urban+Public_or_Private+OECD+Digitals, data = pisa_SC155_NoNA, Hess=T)
summary(fit2)
Call:
MASS::polr(formula = Career_Guidance_Ordinal ~ Urban + Public_or_Private + 
    OECD + Digitals, data = pisa_SC155_NoNA, Hess = T)

Coefficients:
                    Value Std. Error t value
Urban             0.08192    0.01091   7.506
Public_or_Private 0.27837    0.03575   7.787
OECD              0.58924    0.02746  21.460
Digitals          0.39795    0.02327  17.098

Intercepts:
    Value   Std. Error t value
0|1 -0.0987  0.0711    -1.3881
1|2  2.1679  0.0726    29.8705
2|3  3.6658  0.0757    48.4009
3|4  5.1988  0.0826    62.9217

Residual Deviance: 49559.13 
AIC: 49575.13 
Code
broom::glance(model_4)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic   p.value    df  logLik    AIC    BIC
      <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>   <dbl>  <dbl>  <dbl>
1    0.0471        0.0469 0.966      229. 1.31e-192     4 -25701. 51414. 51461.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

But model_4 still shows the relative lower R squared. In addition, the diagnostics plots show certain patterns. Both remind me of further exploring a different model. So I move to try the logistic regression model and calculate the AICs for this two models. Finally, the logistic regression model (fit2) has a lower AIC.

Interpret the Model

Code
# Interpret and Visualize the fit2
library(MASS)

Attaching package: 'MASS'
The following object is masked from 'package:dplyr':

    select
Code
newdat <- data.frame(
  Urban = rep(1:5, 3712),
  Public_or_Private = rep(1:2, each = 9280 ),
  OECD = rep(0:1, eache = 9280),
  Digitals = rep(1:4, 4640))

newdat <- cbind(newdat, predict(fit2, newdat, type = "probs"))
lnewdat <- melt(newdat, id.vars = c("Urban", "Public_or_Private", "OECD",  "Digitals"),
                variable.name = "Level", value.name="Probability")

ggplot(lnewdat, aes(x = Digitals, y = Probability, colour = Level))+
  geom_line() + 
  facet_grid(Urban ~ Public_or_Private~OECD , labeller="label_both")

The graphics have reflected the complexities of the relationship between “Career_Guidance” and “Digitial.” The relationship has slightly distinct patterns in non-OECD and OECD countries. In non-OECD countries, with the lower levels (0 and 1) of “Career Guidance”, the possibility remain almost unchanged as “Digitals” increases. However, in OECD countries, with the lowers of “Career Guidance”, the possibility will drop when “Digitals” goes up. With the higher levels (2 and 3), the possibility will increase when “Digitals” increases in both groups of countries. But when “Career Guidance” reaches to 4, the possibility approaches to zero because the sample is too tiny.

Conclusion

My project has reflected the complexities of the relationship between “Career_Guidance” and “Digitials.” The relationship has slightly distinct patterns in non-OECD and OECD countries.

Reference

OECD PISA 2018 School Questionnaire Dataset-https://www.kaggle.com/datasets/dilaraahan/pisa-2018-school-questionnaire?resource=download OECD PISA2018 Code Book-https://www.oecd.org/pisa/data/2018database/ PISA2018 “School Questionnaire”-https://www.oecd.org/pisa/data/2018database/CY7_201710_QST_MS_SCQ_NoNotes_final.pdf OECD PISA website-https://www.oecd.org/pisa

R Language as programming language

Wickham, H., & Grolemund, G. (2016). R for data science: Visualize, model, transform, tidy, and import data. OReilly Media.

Danielle Navarro (2023), Learning statistics with R: A tutorial for psychology students and other beginners (Version 0.6) :https://learningstatisticswithr.com/lsr-0.6.pdf

Christoph Hanck, Martin Arnold, Alexander Gerber, and Martin Schmelzer (2023). Introduction to Econometrics with R: https://www.econometrics-with-r.org/index.html