Final Project Check 1

fpc1

research question

desriptive statistics

Author

Guanhua Tan

Published

March 4, 2023

My final project will be a further investigation on digital devices in schools that I have submitted as the final project for DACSS 601. I still explore the data from the survey “Programme for International Student Assessment” in 2018. In this assignment, I will propose my hypothesis, and present the descriptive statistics with minor changes base on my last project.

Code

library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.4.1 
✔ readr   2.1.3      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Code

library(ggplot2)
library(dbplyr)


Attaching package: 'dbplyr'

The following objects are masked from 'package:dplyr':

    ident, sql

Code

pisa <- read_csv('_data/CY07_MSU_SCH_QQQ.csv')

New names:
Rows: 21903 Columns: 198
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(8): CNT, CYC, NatCen, STRATUM, SUBNATIO, SC053D11TA, PRIVATESCH, VER_DAT dbl
(189): ...1, CNTRYID, CNTSCHID, Region, OECD, ADMINMODE, LANGTEST, SC001... lgl
(1): BOOKID
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`

Research Questions

My final project will probe into what factors contribute to the accessibility to and human resources’ support for digital devices in schools. Additionally, I will explore if there is a correlations between career guidance and digital devices? I will conduct this research based on the data “Programme for International Student Assessment” (PISA) collected by the The Organization for Economic Co-operation and Development (OECD) in 2018.

Hpyotheis

I propose that the size of urban population primarily contributes to the conditions of digital device. “OECD or Non-OECD” and “public or private schools” may be two cofounders, which is suppose to be incorporated into the regression analysis. Also, I hypothesize that the higher score a school report regarding career guidance, the higher score a school reports in terms of digital divices.

Code

# create a data frame
#view(pisa)
# select related variable
pisa_selected <- select(pisa,starts_with(c("SC001", "SC013", "SC016", "SC161","SC155")))
pisa2018_joint <-cbind(pisa[, 1:12], pisa_selected)
# pisa_SC155
pisa2018_joint$Accessibility=rowMeans(pisa2018_joint[,c("SC155Q01HA","SC155Q02HA",                                                  "SC155Q03HA","SC155Q04HA")])
pisa2018_joint$Human_Resource_Support=rowMeans(pisa2018_joint[
  ,c("SC155Q05HA","SC155Q06HA", "SC155Q07HA","SC155Q08HA","SC155Q09HA", "SC155Q10HA", "SC155Q11HA")])
pisa2018_joint$Career_Guidance=rowSums(pisa2018_joint[, c("SC161Q02SA","SC161Q03SA","SC161Q04SA","SC161Q04SA")])
pisa_SC155 <- pisa2018_joint %>%
  select(CNT, STRATUM, OECD, Career_Guidance,Accessibility, Human_Resource_Support, SC001Q01TA, SC013Q01TA) %>%
  mutate(Urban=SC001Q01TA, Public_or_Private=SC013Q01TA) %>%
  select(-c(SC001Q01TA, SC013Q01TA)) %>%
  select(c(CNT,STRATUM,OECD,Urban, Public_or_Private,Career_Guidance,Accessibility,Human_Resource_Support))
pisa_SC155

Descriptive Statistics

This original OECD PISA 2018 School Questionnaire Dataset is one part of PISA 2018 dataset with a focus on schools. It covers 80 countries and regions all over the world. The dataset documents 21,903 schools’ responses regarding 187 questions.After cleaning the data, the dataset includes 8 variables: CNT identifies countries. STRATUM identifies schools. OECD indicates if a school locates in a OECD country or not. Urban describes different conditions of urban communities where a school locates. Public_or_Private presents if a school is public or private. Career_Guidance demonstrates the score a school reports in terms of career guidance. Accessibility demonstrates the score a school reports in terms of accessibility to digital devices. Human_Resource_Support suggests the score a school reports in terms of human ressource support for digital devices.

After using the summary function and visualization, I have already show the descriptive statistics. A large number of NA stands out. I will figure out how to deal with them properly.

Code

summary(pisa_SC155)

     CNT              STRATUM               OECD            Urban      
 Length:21903       Length:21903       Min.   :0.0000   Min.   :1.000  
 Class :character   Class :character   1st Qu.:0.0000   1st Qu.:2.000  
 Mode  :character   Mode  :character   Median :1.0000   Median :3.000  
                                       Mean   :0.5171   Mean   :3.007  
                                       3rd Qu.:1.0000   3rd Qu.:4.000  
                                       Max.   :1.0000   Max.   :5.000  
                                                        NA's   :1363   
 Public_or_Private Career_Guidance Accessibility   Human_Resource_Support
 Min.   :1.00      Min.   :0.000   Min.   :1.000   Min.   :1.000         
 1st Qu.:1.00      1st Qu.:1.000   1st Qu.:2.000   1st Qu.:2.286         
 Median :1.00      Median :1.000   Median :2.750   Median :2.714         
 Mean   :1.19      Mean   :1.518   Mean   :2.674   Mean   :2.658         
 3rd Qu.:1.00      3rd Qu.:2.000   3rd Qu.:3.250   3rd Qu.:3.000         
 Max.   :2.00      Max.   :4.000   Max.   :4.000   Max.   :4.000         
 NA's   :2092      NA's   :1499    NA's   :1185    NA's   :1236

Code

pisa_SC155_boxplot<-pisa_SC155 %>%
  select(STRATUM, Career_Guidance, Accessibility, Human_Resource_Support) %>% 
  pivot_longer(cols=c(Career_Guidance, Accessibility, Human_Resource_Support), 
               names_to = "Group", values_to = "Evaluation")

ggplot(pisa_SC155_boxplot,aes(Evaluation, fill=Group))+
  stat_boxplot(geom = "errorbar", # Error bars
               width = 0.2)+
  geom_boxplot()+
  facet_wrap(~Group)+
  labs(title="Pisa2018 Evaluation")+
  coord_flip()

Warning: Removed 3920 rows containing non-finite values (`stat_boxplot()`).
Removed 3920 rows containing non-finite values (`stat_boxplot()`).