Final Project Check 2

research question
desriptive statistics
model
Final Project Check 2
Author

Guanhua Tan

Published

April 4, 2023

My final project will be a further investigation on digital devices in schools that I have submitted as the final project for DACSS 601. I still explore the data from the survey “Programme for International Student Assessment” in 2018. In this assignment, I will propose my hypothesis, and present the descriptive statistics with minor changes base on my last project.

Code
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.4.1 
✔ readr   2.1.3      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Code
library(ggplot2)
library(dbplyr)

Attaching package: 'dbplyr'

The following objects are masked from 'package:dplyr':

    ident, sql
Code
library(stargazer)

Please cite as: 

 Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
 R package version 5.2.3. https://CRAN.R-project.org/package=stargazer 
Code
library(misty)
|-------------------------------------|
| misty 0.4.8 (2023-03-10)            |
| Miscellaneous Functions T. Yanagida |
|-------------------------------------|
Code
library(MASS)

Attaching package: 'MASS'

The following object is masked from 'package:dplyr':

    select
Code
library(reshape2)

Attaching package: 'reshape2'

The following object is masked from 'package:tidyr':

    smiths
Code
pisa <- read_csv('_data/CY07_MSU_SCH_QQQ.csv')
New names:
Rows: 21903 Columns: 198
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(8): CNT, CYC, NatCen, STRATUM, SUBNATIO, SC053D11TA, PRIVATESCH, VER_DAT dbl
(189): ...1, CNTRYID, CNTSCHID, Region, OECD, ADMINMODE, LANGTEST, SC001... lgl
(1): BOOKID
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`

Research Questions

My final project will probe into what factors contribute to the accessibility to and human resources’ support for digital devices in schools. Additionally, I will explore if there is a correlations between career guidance and digital devices? I will conduct this research based on the data “Programme for International Student Assessment” (PISA) collected by the The Organization for Economic Co-operation and Development (OECD) in 2018.

Hpyotheis

I propose that the size of urban population primarily contributes to the conditions of digital device. “OECD or Non-OECD” and “public or private schools” may be two cofounders, which is suppose to be incorporated into the regression analysis. Also, I hypothesize that the higher score a school report regarding career guidance, the higher score a school reports in terms of digital divices.

Code
# create a data frame
#view(pisa)
# select related variable
pisa_selected <- select(pisa,starts_with(c("SC001", "SC013", "SC016", "SC161","SC155")))
Error in select(pisa, starts_with(c("SC001", "SC013", "SC016", "SC161", : unused argument (starts_with(c("SC001", "SC013", "SC016", "SC161", "SC155")))
Code
pisa2018_joint <-cbind(pisa[, 1:12], pisa_selected)
Error in data.frame(..., check.names = FALSE): object 'pisa_selected' not found
Code
# pisa_SC155
pisa2018_joint$Digitals=rowMeans(pisa2018_joint[,c("SC155Q01HA","SC155Q02HA",                                                  "SC155Q03HA","SC155Q04HA","SC155Q05HA","SC155Q06HA", "SC155Q07HA","SC155Q08HA","SC155Q09HA", "SC155Q10HA", "SC155Q11HA")])
Error in is.data.frame(x): object 'pisa2018_joint' not found
Code
pisa2018_joint$Career_Guidance=rowSums(pisa2018_joint[, c("SC161Q02SA","SC161Q03SA","SC161Q04SA","SC161Q04SA")])
Error in is.data.frame(x): object 'pisa2018_joint' not found
Code
pisa_SC155 <- pisa2018_joint %>%
  select(CNT, STRATUM, OECD, Career_Guidance,Digitals, SC001Q01TA, SC013Q01TA) %>%
  mutate(Urban=SC001Q01TA, Public_or_Private=SC013Q01TA) %>%
  select(-c(SC001Q01TA, SC013Q01TA)) %>%
  select(c(CNT,STRATUM,OECD,Urban, Public_or_Private,Career_Guidance,Digitals))
Error in select(., c(CNT, STRATUM, OECD, Urban, Public_or_Private, Career_Guidance, : unused argument (c(CNT, STRATUM, OECD, Urban, Public_or_Private, Career_Guidance, Digitals))
Code
pisa_SC155
Error in eval(expr, envir, enclos): object 'pisa_SC155' not found

Descriptive Statistics

This original OECD PISA 2018 School Questionnaire Dataset is one part of PISA 2018 dataset with a focus on schools. It covers 80 countries and regions all over the world. The dataset documents 21,903 schools’ responses regarding 187 questions.After cleaning the data, the dataset includes 8 variables: CNT identifies countries. STRATUM identifies schools. OECD indicates if a school locates in a OECD country or not. Urban describes different conditions of urban communities where a school locates. Public_or_Private presents if a school is public or private. Career_Guidance demonstrates the score a school reports in terms of career guidance. Accessibility demonstrates the score a school reports in terms of accessibility to digital devices. Human_Resource_Support suggests the score a school reports in terms of human ressource support for digital devices.

After using the summary function and visualization, I have already show the descriptive statistics. A large number of NA stands out. I will figure out how to deal with them properly.

Code
summary(pisa_SC155)
Error in summary(pisa_SC155): object 'pisa_SC155' not found
Code
pisa_SC155_boxplot<-pisa_SC155 %>%
  select(STRATUM, Career_Guidance, Digitals) %>% 
  pivot_longer(cols=c(Career_Guidance, Digitals), 
               names_to = "Group", values_to = "Evaluation")
Error in select(., STRATUM, Career_Guidance, Digitals): unused arguments (STRATUM, Career_Guidance, Digitals)
Code
ggplot(pisa_SC155_boxplot,aes(Evaluation, fill=Group))+
  stat_boxplot(geom = "errorbar", # Error bars
               width = 0.2)+
  geom_boxplot()+
  facet_wrap(~Group)+
  labs(title="Pisa2018 Evaluation")+
  coord_flip()
Error in ggplot(pisa_SC155_boxplot, aes(Evaluation, fill = Group)): object 'pisa_SC155_boxplot' not found

Analysis

Discussion on NA

Code
# missing data
library(naniar)
pisa_SC115_nonOECD <- filter(pisa_SC155, OECD == "0")
Error in filter(pisa_SC155, OECD == "0"): object 'pisa_SC155' not found
Code
pisa_SC115_OECD <- filter(pisa_SC155, OECD == "1")
Error in filter(pisa_SC155, OECD == "1"): object 'pisa_SC155' not found
Code
vis_miss(pisa_SC155)
Error in test_if_dataframe(x): object 'pisa_SC155' not found
Code
vis_miss(pisa_SC115_nonOECD)
Error in test_if_dataframe(x): object 'pisa_SC115_nonOECD' not found
Code
vis_miss(pisa_SC115_OECD)
Error in test_if_dataframe(x): object 'pisa_SC115_OECD' not found

The graphics have disclose that compared with NON-OECD countries, OECD countries missed more data in terms of school type and career guidance. I believe that missing data are not caused by the poor technological condition in non-OECD countries. So missing data is random.

model evaluation

Code
pisa_SC155_NoNA <- drop_na(pisa_SC155)
Error in drop_na(pisa_SC155): object 'pisa_SC155' not found
Code
#summary(pisa_SC155_NoNA)

model_1 <-lm(Career_Guidance~Urban+Public_or_Private+Digitals+Urban*Digitals, data = pisa_SC155_NoNA)
Error in is.data.frame(data): object 'pisa_SC155_NoNA' not found
Code
#summary(model_1)

model_2 <-lm(Career_Guidance~Urban+Public_or_Private+Public_or_Private*Digitals, data = pisa_SC155_NoNA)
Error in is.data.frame(data): object 'pisa_SC155_NoNA' not found
Code
#summary(model_2)

model_3 <-lm(Career_Guidance~Urban+Public_or_Private+Digitals, data = pisa_SC155_NoNA)
Error in is.data.frame(data): object 'pisa_SC155_NoNA' not found
Code
#summary(model_3)


stargazer(model_1, model_2, model_3,  type = 'text')
Error in .stargazer.wrap(..., type = type, title = title, style = style, : object 'model_1' not found

After the model comparison, model_3 presents statistical significance on all independent variables. The career gudience depends on urban situations, school styles and the access to digital devices.

Code
# diagnostics
par(mfrow=c(2,3))
plot(model_3, which= 1:6)
Error in plot(model_3, which = 1:6): object 'model_3' not found
Code
# log-git model
pisa_SC155_NoNA$Career_Guidance_Ordinal <- as.factor(pisa_SC155_NoNA$Career_Guidance)
Error in is.factor(x): object 'pisa_SC155_NoNA' not found
Code
class(pisa_SC155_NoNA$Career_Guidance_Ordinal)
Error in eval(expr, envir, enclos): object 'pisa_SC155_NoNA' not found
Code
table(pisa_SC155_NoNA$Career_Guidance_Ordinal)
Error in table(pisa_SC155_NoNA$Career_Guidance_Ordinal): object 'pisa_SC155_NoNA' not found
Code
fit2<-polr(Career_Guidance_Ordinal~Urban+Public_or_Private+Digitals, data = pisa_SC155_NoNA, Hess=T)
Error in eval(expr, p): object 'pisa_SC155_NoNA' not found
Code
summary(fit2)
Error in summary(fit2): object 'fit2' not found
Code
broom::glance(model_3)
Error in broom::glance(model_3): object 'model_3' not found

But model_3 still shows the relative lower R squared. In addition, the diagnostics plots show certain patterns. Both remind me of further exploring a different model. So I plan to try log-git model and calculate the AICs for this two models. Final the log-git model (fit2) has a lower AIC.

Interpret the Model

Code
#ctable <- coef(summary(fit2))
#p <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE) * 2
#ctable <- cbind(ctable, "p value" = p)
#ctable
#exp(coef(fit2))

newdat <- data.frame(
  Urban = rep(1:5, 3712),
  Public_or_Private = rep(1:2, each = 9280 ),
  Digitals = rep(1:4, 4640))

newdat <- cbind(newdat, predict(fit2, newdat, type = "probs"))
Error in predict(fit2, newdat, type = "probs"): object 'fit2' not found
Code
lnewdat <- melt(newdat, id.vars = c("Urban", "Public_or_Private", "Digitals"),
                variable.name = "Level", value.name="Probability")

ggplot(lnewdat, aes(x = Digitals, y = Probability, colour = Level))+
  geom_line() + 
  facet_grid(Urban ~ Public_or_Private , labeller="label_both")
Error in `geom_line()`:
! Problem while computing aesthetics.
ℹ Error occurred in the 1st layer.
Caused by error in `FUN()`:
! object 'Probability' not found