This dataset is one part of PISA 2018 dataset with a focus on schools. It covers 80 countries and different regions within each country. The dataset documents 21,903 schools’ responses regarding 187 questions. Some key identifiers include CNT (Country Name), STRATUM (Region Name) and OECD (belongs to or not OECD) ## Tidy Data (as needed)
Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.
Are there any variables that require mutation to be usable in your analysis stream? For example, do you need to calculate new values in order to graph them? Can string values be represented numerically? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?
Document your work here.
The dataset covers 80 countries and regions including 21903 schools’ responses regrading 187 questions. So I decided to narrow my reach here to five countries. I’d like to look at schools responses regarding a set of questions “SC155”. SC155 surveys the accessibility to digital devices and its related training as well as assistance. I clean the data and group_by data as regional code. So I make two different datasets at first. One includes all identifier information. The other covers the columns that contains “SC155Q”. Then, I use function “cbind” to combine these two into a new dataset (pisa2018_joint). Furthermore, I found that the questions from SC155Q01HA to SC155Q01HA focus on the accessibility to digital devices while the questions from SC155Q05HA to SC155Q11HA stress on if the schools offer enough training, support, and incentives. I mutate two new variables–“Accessibility” and “Human_Resource_Support that respectively calculate the average of the former and the latter. Besides, I calculate every country average for”Accessibility” and “Human_Resource_Support” with the function distinct() and rowMeans() After tidying data, the dataset has four columns and 80 rows.
Time Dependent Visualization (Map)
# joint data--merge pisa2018 into the world map datadata("World")world2<-World %>%mutate(CNT=iso_a3) %>%select(-iso_a3)world_pisa <-merge(x=world2, y=pisa2018_joint_clean, by="CNT", all.x=T)world_pisa
Simple feature collection with 177 features and 18 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -180 ymin: -89.9 xmax: 180 ymax: 83.64513
Geodetic CRS: WGS 84
First 10 features:
CNT name sovereignt continent
1 AFG Afghanistan Afghanistan Asia
2 AGO Angola Angola Africa
3 ALB Albania Albania Europe
4 ARE United Arab Emirates United Arab Emirates Asia
5 ARG Argentina Argentina South America
6 ARM Armenia Armenia Asia
7 ATA Antarctica Antarctica Antarctica
8 ATF Fr. S. Antarctic Lands France Seven seas (open ocean)
9 AUS Australia Australia Oceania
10 AUT Austria Austria Europe
area pop_est pop_est_dens economy
1 652860.000 [km^2] 28400000 4.350090e+01 7. Least developed region
2 1246700.000 [km^2] 12799293 1.026654e+01 7. Least developed region
3 27400.000 [km^2] 3639453 1.328268e+02 6. Developing region
4 71252.172 [km^2] 4798491 6.734519e+01 6. Developing region
5 2736690.000 [km^2] 40913584 1.495003e+01 5. Emerging region: G20
6 28470.000 [km^2] 2967004 1.042151e+02 6. Developing region
7 12259213.973 [km^2] 3802 3.101341e-04 6. Developing region
8 7257.455 [km^2] 140 1.929051e-02 6. Developing region
9 7682300.000 [km^2] 21262641 2.767744e+00 2. Developed region: nonG7
10 82523.000 [km^2] 8210281 9.949082e+01 2. Developed region: nonG7
income_grp gdp_cap_est life_exp well_being footprint inequality
1 5. Low income 784.1549 59.668 3.8 0.79 0.42655744
2 3. Upper middle income 8617.6635 NA NA NA NA
3 4. Lower middle income 5992.6588 77.347 5.5 2.21 0.16513372
4 2. High income: nonOECD 38407.9078 NA NA NA NA
5 3. Upper middle income 14027.1261 75.927 6.5 3.14 0.16423830
6 4. Lower middle income 6326.2469 74.446 4.3 2.23 0.21664810
7 2. High income: nonOECD 200000.0000 NA NA NA NA
8 2. High income: nonOECD 114285.7143 NA NA NA NA
9 1. High income: OECD 37634.0832 82.052 7.2 9.31 0.08067825
10 1. High income: OECD 40132.6093 81.004 7.4 6.06 0.07129351
HPI OECD Accessibility_Country_Ave Human_Resource_ave
1 20.22535 NA NA NA
2 NA NA NA NA
3 36.76687 0 2.260000 2.604157
4 NA 0 3.150000 3.111065
5 35.19024 0 1.919374 2.139867
6 25.66642 NA NA NA
7 NA NA NA NA
8 NA NA NA NA
9 21.22897 1 3.105093 2.796798
10 30.47822 1 3.093531 3.021978
geometry
1 MULTIPOLYGON (((61.21082 35...
2 MULTIPOLYGON (((16.32653 -5...
3 MULTIPOLYGON (((20.59025 41...
4 MULTIPOLYGON (((51.57952 24...
5 MULTIPOLYGON (((-65.5 -55.2...
6 MULTIPOLYGON (((43.58275 41...
7 MULTIPOLYGON (((-59.57209 -...
8 MULTIPOLYGON (((68.935 -48....
9 MULTIPOLYGON (((145.398 -40...
10 MULTIPOLYGON (((16.97967 48...
This data doesn’t document the data in a temporal way but in a geographic way. So a world map would be a best choice to reflect the data across the world.In order to create a world map, I need joint my data to world map data. So I use merge function and created a dataset “world pisa) which include the map data and variables that I’d like to show on the map.
#draw world map show regional difference# draw a world map about the accessibly of digital device#because my dataset reflect regional differences instead of differences over time, I believe that may would be a good choice of showing regional difference# data("world_pisa")tm_shape(world_pisa) +tm_polygons(col="Accessibility_Country_Ave", palette ="Set1")+tmap_options(check.and.fix=T)
Finally, I create two world maps reflecting regional differences in terms of the accessibility to digital devices and the human resources support for them all over the world. Apparently, just a few countries have the grades higher than 3 points, which reflects that these countries’ agreement of enjoying good accessibility to digital devices.However, even some developed countries still expressed their limited accessibility to digital devices such as United Kingdom and France. The phenomenon, in fact, open the more questions and their answers will require further investigations on these country. I will take France and Unite Kingdom as case studies. With respect to “Human Resource Support,” most countries have reported the grades under 3 points. This may not disclose that these countries share the similar scenarios but there is no any objective measurements for them to measure their access to and human resources support on digital devices. Therefore, instead of reflecting that OECD countries lack human resources support, the data may show OECD and non-OECD countries have distinct expectations on human resources support. So self-report can only demonstrate the gap between their expectations and current situations.
# United Kingdom (GBR), Hong Kong (HKG), Philippines (PHL), Argentina (ARG), Brazil(BRA)df<-pisa2018_joint %>%group_by(CNT) %>%filter(CNT=="GBR"| CNT=="HKG"| CNT=="PHL"| CNT=="ARG"| CNT=="BRA"|CNT=="FRA" ) %>%drop_na() %>%count(Q1=factor(SC155Q01HA))%>%mutate(pct=paste0(sprintf("%4.1f",n/sum(n)*100), "%"))df
# A tibble: 24 × 4
# Groups: CNT [6]
CNT Q1 n pct
<chr> <fct> <int> <chr>
1 ARG 1 156 "36.7%"
2 ARG 2 145 "34.1%"
3 ARG 3 94 "22.1%"
4 ARG 4 30 " 7.1%"
5 BRA 1 215 "39.5%"
6 BRA 2 198 "36.4%"
7 BRA 3 100 "18.4%"
8 BRA 4 31 " 5.7%"
9 FRA 1 15 " 7.1%"
10 FRA 2 40 "19.0%"
# … with 14 more rows
ggplot(df, aes(x=CNT, y=n ,fill=Q1), na.rm=T)+geom_bar(stat="identity", position="dodge")+geom_text(aes(label=pct), position =position_dodge(width =1),vjust=-0.5, size=2)
The bar chart reflects schools’ responses to SC155Q01HA–“The number of digital devices connected to the Internet is sufficient” It has demonstrated that most schools in Argentina and Brazil are not happy with their accessibility to digital devices–more 70% of them report “strongly disagree” and “disagree.” Schools in the Philippines enjoyed the better accessibility to digital devices–less than 60% of them return the native responses. By sharp contrast, 60-70% of schools in France, the United Kingdom and Hong Kong are happy with their accessibility. Additionally, it discloses sample sizes in six countries. Surprisingly, French sample size is relatively small. The PISA doesn’t explain the reason.
Source Code
---title: "Challenge 6 Guanhua Tan"author: "Guanhua Tan"description: "Visualizing Time and Relationships"date: "11/1/2022"format: html: toc: true code-copy: true code-tools: truecategories: - challenge_6 - pisa2018---```{r}#| label: setup#| warning: false#| message: falselibrary(tidyverse)library(ggplot2)install.packages("tmap")library(tmap)library(dplyr)install.packages("s2")library(s2)knitr::opts_chunk$set(echo =TRUE, warning=FALSE, message=FALSE)```## Challenge OverviewToday's challenge is to:1) read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)2) tidy data (as needed, including sanity checks)3) mutate variables as needed (including sanity checks)4) create at least one graph including time (evolution)- try to make them "publication" ready (optional)- Explain why you choose the specific graph type5) Create at least one graph depicting part-whole or flow relationships- try to make them "publication" ready (optional)- Explain why you choose the specific graph type[R Graph Gallery](https://r-graph-gallery.com/) is a good starting point for thinking about what information is conveyed in standard graph types, and includes example R code.(be sure to only include the category tags for the data you use!)## Read in dataRead in one (or more) of the following datasets, using the correct R package and command.- debt ⭐- fed_rate ⭐⭐- abc_poll ⭐⭐⭐- usa_hh ⭐⭐⭐- hotel_bookings ⭐⭐⭐⭐- AB_NYC ⭐⭐⭐⭐⭐```{r}# read the datapisa <-read_csv('_data/CY07_MSU_SCH_QQQ.csv')dim(pisa)length(unique(pisa$CNT))pisa2018 <- pisa %>%select(starts_with("SC155"))dim(pisa2018)```### Briefly describe the dataThis dataset is one part of PISA 2018 dataset with a focus on schools. It covers 80 countries and different regions within each country. The dataset documents 21,903 schools' responses regarding 187 questions. Some key identifiers include CNT (Country Name), STRATUM (Region Name) and OECD (belongs to or not OECD)## Tidy Data (as needed)Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.```{r}#tidy datapisa2018_joint <-cbind(pisa[,1:12], pisa2018) %>%select(CNT, STRATUM, OECD, SC155Q01HA,SC155Q02HA, SC155Q03HA, SC155Q04HA, SC155Q05HA, SC155Q06HA, SC155Q07HA, SC155Q08HA, SC155Q09HA, SC155Q10HA, SC155Q11HA)pisa2018_joint$Accessibility=rowMeans(pisa2018_joint[,c("SC155Q01HA","SC155Q02HA", "SC155Q03HA","SC155Q04HA")])pisa2018_joint$Human_Resource_Support=rowMeans(pisa2018_joint[,c("SC155Q05HA","SC155Q06HA", "SC155Q07HA","SC155Q08HA","SC155Q09HA", "SC155Q10HA", "SC155Q11HA")])pisa2018_joint_clean <-pisa2018_joint %>%select(CNT, STRATUM, OECD, Accessibility, Human_Resource_Support) %>%group_by(CNT) %>%mutate(Accessibility_Country_Ave=mean(Accessibility, na.rm=T)) %>%mutate(Human_Resource_ave=mean(Human_Resource_Support, na.rm=T)) %>%select(CNT,OECD, Accessibility_Country_Ave, Human_Resource_ave) %>%distinct() %>%arrange(desc(Accessibility_Country_Ave))pisa2018_joint_clean```Are there any variables that require mutation to be usable in your analysis stream? For example, do you need to calculate new values in order to graph them? Can string values be represented numerically? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?Document your work here.The dataset covers 80 countries and regions including 21903 schools' responses regrading 187 questions. So I decided to narrow my reach here to five countries. I'd like to look at schools responses regarding a set of questions "SC155". SC155 surveys the accessibility to digital devices and its related training as well as assistance. I clean the data and group_by data as regional code. So I make two different datasets at first. One includes all identifier information. The other covers the columns that contains "SC155Q". Then, I use function "cbind" to combine these two into a new dataset (pisa2018_joint). Furthermore, I found that the questions from SC155Q01HA to SC155Q01HA focus on the accessibility to digital devices while the questions from SC155Q05HA to SC155Q11HA stress on if the schools offer enough training, support, and incentives. I mutate two new variables--"Accessibility" and "Human_Resource_Support that respectively calculate the average of the former and the latter. Besides, I calculate every country average for "Accessibility" and "Human_Resource_Support" with the function distinct() and rowMeans() After tidying data, the dataset has four columns and 80 rows.## Time Dependent Visualization (Map)```{r}# joint data--merge pisa2018 into the world map datadata("World")world2<-World %>%mutate(CNT=iso_a3) %>%select(-iso_a3)world_pisa <-merge(x=world2, y=pisa2018_joint_clean, by="CNT", all.x=T)world_pisa```This data doesn't document the data in a temporal way but in a geographic way. So a world map would be a best choice to reflect the data across the world.In order to create a world map, I need joint my data to world map data. So I use merge function and created a dataset "world pisa) which include the map data and variables that I'd like to show on the map.```{r}#draw world map show regional difference# draw a world map about the accessibly of digital device#because my dataset reflect regional differences instead of differences over time, I believe that may would be a good choice of showing regional difference# data("world_pisa")tm_shape(world_pisa) +tm_polygons(col="Accessibility_Country_Ave", palette ="Set1")+tmap_options(check.and.fix=T)tm_shape(world_pisa) +tm_polygons(col="Human_Resource_ave", palette ="Set1")+tmap_options(check.and.fix=T)```Finally, I create two world maps reflecting regional differences in terms of the accessibility to digital devices and the human resources support for them all over the world. Apparently, just a few countries have the grades higher than 3 points, which reflects that these countries' agreement of enjoying good accessibility to digital devices.However, even some developed countries still expressed their limited accessibility to digital devices such as United Kingdom and France. The phenomenon, in fact, open the more questions and their answers will require further investigations on these country. I will take France and Unite Kingdom as case studies. With respect to "Human Resource Support," most countries have reported the grades under 3 points. This may not disclose that these countries share the similar scenarios but there is no any objective measurements for them to measure their access to and human resources support on digital devices. Therefore, instead of reflecting that OECD countries lack human resources support, the data may show OECD and non-OECD countries have distinct expectations on human resources support. So self-report can only demonstrate the gap between their expectations and current situations.## Visualizing Part-Whole Relationships```{r}pisa2018_joint <-cbind(pisa[,1:12], pisa2018) %>%select(CNT, STRATUM, OECD, SC155Q01HA,SC155Q02HA, SC155Q03HA, SC155Q04HA, SC155Q05HA, SC155Q06HA, SC155Q07HA, SC155Q08HA, SC155Q09HA, SC155Q10HA, SC155Q11HA) pisa2018_joint``````{r}# United Kingdom (GBR), Hong Kong (HKG), Philippines (PHL), Argentina (ARG), Brazil(BRA)df<-pisa2018_joint %>%group_by(CNT) %>%filter(CNT=="GBR"| CNT=="HKG"| CNT=="PHL"| CNT=="ARG"| CNT=="BRA"|CNT=="FRA" ) %>%drop_na() %>%count(Q1=factor(SC155Q01HA))%>%mutate(pct=paste0(sprintf("%4.1f",n/sum(n)*100), "%"))dfggplot(df, aes(x=CNT, y=n ,fill=Q1), na.rm=T)+geom_bar(stat="identity", position="dodge")+geom_text(aes(label=pct), position =position_dodge(width =1),vjust=-0.5, size=2)```The bar chart reflects schools' responses to SC155Q01HA--"The number of digital devices connected to the Internet is sufficient" It has demonstrated that most schools in Argentina and Brazil are not happy with their accessibility to digital devices--more 70% of them report "strongly disagree" and "disagree." Schools in the Philippines enjoyed the better accessibility to digital devices--less than 60% of them return the native responses. By sharp contrast, 60-70% of schools in France, the United Kingdom and Hong Kong are happy with their accessibility. Additionally, it discloses sample sizes in six countries. Surprisingly, French sample size is relatively small. The PISA doesn't explain the reason.