DACSS 601 - HW4: Descriptive Statistics and Visualization
For this project I changed my data set from the DOD Active Military Marital Data to this new source on school closures during the covid-19 pandemic. The data was procured from UNESCO Institute of Statistics data on COVID-19 Education Response (sourced below).
Context
On March 11,2020, the World Health Organization declared the novel coronavirus, Covid-19, a pandemic. Within 2 days, the then US President Donald Trump declared a nationwide emergency and by the end of the month, populations across the United States and the globe began entering lock down and social distancing. During the first few months of the pandemic, many businesses, institutions, and organizations across the globe reduced occupancy or closed their doors in an attempt to control the spread of the virus. Schools across the world made decisions on whether to continue in person learning or adopt distance learning practices. This project will review the status of school closures by country during 2020-2021 of the pandemic and what characteristics distinguished those who adopted the closures or not.
Content and Research Questions
The data set contains daily school closure status for 210 countries/ territories from 2/16/2020 to 3/31/2022. This results in over 162,750 observations over the course of 775 days.The data also provides static information such as the approximate number of enrolled students and teachers for pre-primary to secondary school as well as regional location, country economic level, and access to distance learning technologies.
My research questions include:
* How did the practice of school closures and re-openings unfold over the pandemic years of 2020 - 2021?
* What characteristics, if any, by geographic location, country income level, student population size, or access to distance learning modalities could be predictors of adopting similar measures for similar events in the future?
Data Set Variables
Categorical variables:
- Country ID = Country ISO Alpha-3 code
- Country = Country name (English)
- Income Level = World Bank country income groups (i.e. high income, upper middle income, lower middle income, and low income)
- Regional Name = Sustainable Development Goals regional groups (i.e. Africa (Sub-Saharan); Asia (Central and Southern); Asia (Eastern and South-eastern); Latin America and Caribbean; Northern America and Europe; Oceania; and Western Asia and Northern Africa)
- School Status = status of school at time of data collection (i.e. Academic break; Closed due to COVID-19; Fully open; Seasonal school closures; and Partially open)
- Distance learning modalities (TV) = Existence of distance learning modalities (TV) in the country
- Distance learning modalities (Radio) = Existence of distance learning modalities (Radio) in the country
- Distance learning modalities (Online) = Existence of distance learning modalities (Online) in the country
- Distance learning modalities (Global) = Existence of distance learning modalities (combination of TV+Radio+Online) in the country
Numeric variables:
- Date = Reference date
- Enrolment (Pre-Primary to Upper Secondary) = number of enrolled students in Pre-Primary to Upper Secondary school levels
- Teachers (Pre-Primary to Upper Secondary)= number of teachers for Pre-Primary to Upper Secondary school levels
- School Age Population (Pre-Primary to Upper Secondary)- number of school age population at Pre-Primary to Upper Secondary school levels
- Weeks partially open- total number of weeks partially open
- Weeks fully closed- total number of weeks fully closed
Numeric Variables
To begin, I created a table summarizing all of the numeric variables in the data set.
summary(unesco_fin)
Date Country ID Country
Min. :2020-02-16 Length:162750 Length:162750
1st Qu.:2020-08-27 Class :character Class :character
Median :2021-03-09 Mode :character Mode :character
Mean :2021-03-09
3rd Qu.:2021-09-19
Max. :2022-03-31
Regional Name Income Level Status
Length:162750 Length:162750 Length:162750
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
Enrolment (Pre-Primary to Upper Secondary)
Min. : 0
1st Qu.: 208294
Median : 1406005
Mean : 7340691
3rd Qu.: 5360166
Max. :294893120
Teachers (Pre-Primary to Upper Secondary)
Min. : 0
1st Qu.: 10713
Median : 80550
Mean : 370162
3rd Qu.: 241546
Max. :15625021
School Age Population (Pre-Primary to Upper Secondary)
Min. : 0
1st Qu.: 204277
Median : 1594522
Mean : 8869652
3rd Qu.: 7045479
Max. :368816440
Distance learning modalities (TV)
Length:162750
Class :character
Mode :character
Distance learning modalities (Radio)
Length:162750
Class :character
Mode :character
Distance learning modalities (Global)
Length:162750
Class :character
Mode :character
Distance learning modalities (Online) Weeks fully closed
Length:162750 Min. : 0.00
Class :character 1st Qu.:10.00
Mode :character Median :16.00
Mean :19.74
3rd Qu.:27.00
Max. :75.00
Weeks partially open
Min. : 0.00
1st Qu.: 6.00
Median :18.50
Mean :20.85
3rd Qu.:30.00
Max. :77.00
Starting with the Date, we see the data was collected between 2/16/2020 and 3/31/2022 with the mid-point at approximately 3/09/2021. Next we observe a wide range of values for the total enrolled students and teacher population size. For total number of enrolled students, observations range from 0 in Svalbard, Faroe Islands, and Greenland to the mean of 7.3 million, similar to the average enrollment between Cameroon and Uzbekistan, to the maximum of 294.8 million students in India.
For teachers, observations range from 0 in the countries mentioned earlier to an average of approximately 370,000 similar to that of Nepal to the maximum of 15.6 million teachers in China. From here we can also add an estimate of the student to teacher ratio.
#add ratio of enrolled students to teachers
unesco_fin$Enrol_Teacher_Ratio <- unesco_fin$`Enrolment (Pre-Primary to Upper Secondary)`/ unesco_fin$`Teachers (Pre-Primary to Upper Secondary)`
#summary of enrollment to teacher ratio
summary(unesco_fin$Enrol_Teacher_Ratio)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
6.437 12.895 16.951 20.043 25.268 62.971 2325
The above summary of enrolled students to teacher ratio shows an average of 20 students/ teacher across the globe. Note, the 2325 NAs were for the three countries/territories of Svalbard, Faroe Islands, and Greenland in which no value is present. We will come back to this ratio to determine if this value is reflective of the ability to socially distance and thus remain to turn to opening.
Finally, reviewing the results of the Weeks fully closed and Weeks partially open, we see similarities in the average amount of weeks closed and partially open (19.74 to 20.85). However, I looked at the mean, median, IQR, standard deviation and variance for the number of weeks schools were fully closed and partially open.
#Descriptive stats
#descriptive stats of weeks fully closed
unesco_fin%>%
summarise(
mean.closed= mean(`Weeks fully closed`, na.rm=TRUE),
median.closed= median(`Weeks fully closed`, na.rm=TRUE),
IQR.closed= IQR(`Weeks fully closed`, na.rm=TRUE),
sd.closed= sd(`Weeks fully closed`),
var.closed= var(`Weeks fully closed`))
# A tibble: 1 x 5
mean.closed median.closed IQR.closed sd.closed var.closed
<dbl> <dbl> <dbl> <dbl> <dbl>
1 19.7 16 17 14.6 214.
#descriptive stats of weeks partially closed
unesco_fin%>%
summarise(
mean.partopen= mean(`Weeks partially open`, na.rm=TRUE),
median.partopen= median(`Weeks partially open`, na.rm=TRUE),
IQR.partopen= IQR(`Weeks partially open`, na.rm=TRUE),
sd.partopen= sd(`Weeks partially open`),
var.partopen= var(`Weeks partially open`))
# A tibble: 1 x 5
mean.partopen median.partopen IQR.partopen sd.partopen var.partopen
<dbl> <dbl> <dbl> <dbl> <dbl>
1 20.8 18.5 24 17.3 300.
The first table shows the average number of weeks closed (19.7), middle value (16), interquartile range (17) as well as standard deviation (14.6) and variance (214).
The second table shows the average number of weeks partially open (20.9), middle value (18.5), interquartile range (24) as well as standard deviation (17.3) and variance (300).
At a glance, we see a higher standard deviation and variance among the observations in time spent partially open than those fully closed. This suggests higher variability in partially open responses. Note, the total number of weeks under observation were 110 weeks and 5 days over the 775 days reviewed. Thus an average of 20 weeks closed or partially open equates to about 18% of the 2 years observed.
Categorical Variables
Next we will look at regional and income level variable frequency.
The table below shows the distribution of countries by the Sustainable Development Goals regional groups.
#table of countries by regions
table(unesco_short$`Regional Name`)
Africa (Sub-Saharan) Asia (Central and Southern)
48 14
Asia (Eastern and South-eastern) Latin America and the Caribbean
16 41
Northern America and Europe Oceania
50 17
Western Asia and Northern Africa
24
The next table is income level distribution by country based on the World Bank country income groups. Note there are 6 countries in which no data was captured: Anguilla, Cook Islands, Montserrat, Niue, Svalbard, and Tokelau.
#cross-tabulation of region by country
table(unesco_short$`Income Level`)
High income Low income
6 71 29
Lower middle income Upper middle income
50 54
Combing the previous two tables, below is cross-tabulation of country count by income level and regional name.
#cross-tabulation of learning mods by country income level
xtabs(~`Regional Name` + `Income Level`, unesco_short)
Income Level
Regional Name High income Low income
Africa (Sub-Saharan) 0 2 22
Asia (Central and Southern) 0 0 2
Asia (Eastern and South-eastern) 0 4 1
Latin America and the Caribbean 2 14 1
Northern America and Europe 1 39 0
Oceania 3 4 0
Western Asia and Northern Africa 0 8 3
Income Level
Regional Name Lower middle income
Africa (Sub-Saharan) 19
Asia (Central and Southern) 8
Asia (Eastern and South-eastern) 7
Latin America and the Caribbean 4
Northern America and Europe 2
Oceania 5
Western Asia and Northern Africa 5
Income Level
Regional Name Upper middle income
Africa (Sub-Saharan) 5
Asia (Central and Southern) 4
Asia (Eastern and South-eastern) 4
Latin America and the Caribbean 20
Northern America and Europe 8
Oceania 5
Western Asia and Northern Africa 8
I will come back to this table when we review the how, or if, income level and region are potential indicators of adopting school closure practices in future public health crisis.
#average number of weeks fully closed and partially open by region
unesco_short %>%
group_by(`Regional Name`) %>%
select(starts_with("Weeks")) %>%
summarize_all(mean, na.rm = TRUE)
# A tibble: 7 x 3
`Regional Name` `Weeks fully clo~` `Weeks partial~`
<chr> <dbl> <dbl>
1 Africa (Sub-Saharan) 18.1 13.3
2 Asia (Central and Southern) 24.4 27.8
3 Asia (Eastern and South-eastern) 24.4 30.6
4 Latin America and the Caribbean 29.6 32.3
5 Northern America and Europe 12.4 18
6 Oceania 7.12 6.24
7 Western Asia and Northern Africa 24.6 22.0
Interestingly, the regions with the longest time fully closed as well as partially open are: Asia (Central and Southern), Asia (Eastern and South-eastern), Latin America and the Caribbean, and Western Asia and Northern Africa. Oceania experienced the least amount of school disruption. This may be due to the geographical isolation limiting the spread of the virus.
This series of graphics provides a detailed visual representation of the distribution of weeks closed by country and region. The aim is to quickly identify outliers and trends within each region.
Generally, we can see regions responded with similar thresholds of closures with the exception of an outlier or two in the group. This is evident in Fiji of the Oceanic region, Uganda in Sub-Saharan Africa, and Bangladesh in Central/Southern Asia among others.There are also several observations where the weeks fully close total zero or no data (i.e. the United States, Sweden, Tajikistan, Nicaragua, and Burundi.
Limitations/Next Steps: I’d like to see all the countries on plot in descending order of weeks fully closed with the bar colors distinguished by region. I think this will show a full-scale comparison of school closures. I also need to review the specifics of the zero/ no data locations.
Access to Distance Learning
This section of plots summarizes the students’ access to distance learning modalities by country income level and number of weeks fully closed.
The plot shows the country income level distribution by types of distance learning modes. The aim is to uncover if there are trends in income and the type and number of modalities available.
#bar chart of countries by distance learning mods
unesco_short %>%
ggplot(aes(`Distance learning modalities (Global)`, fill= `Income Level`))+
geom_bar(position = "stack")+
scale_fill_brewer(palette = "BuPu")+
labs(y= "No. of Countries", title = "Count of Countries by Distance Learning Modalities")+
guides(x = guide_axis(n.dodge = 2))
#cross-tabulation of learning mods by country income level
xtabs(~`Distance learning modalities (Global)` + `Income Level`, unesco_short)
Income Level
Distance learning modalities (Global) High income Low income
None 6 14 8
Online 0 17 0
Online + Radio 0 0 1
Online + TV 0 33 3
Radio 0 0 5
TV 0 1 1
TV + Online + Radio 0 6 9
TV + Radio 0 0 2
Income Level
Distance learning modalities (Global) Lower middle income
None 2
Online 2
Online + Radio 3
Online + TV 16
Radio 0
TV 2
TV + Online + Radio 20
TV + Radio 5
Income Level
Distance learning modalities (Global) Upper middle income
None 4
Online 4
Online + Radio 2
Online + TV 23
Radio 0
TV 1
TV + Online + Radio 20
TV + Radio 0
Here we see lower and upper-middle income countries gravitating towards Online/TV and Online/TV/Radio modalities. These two modes appear the most popular among countries. Interestingly, we see high income countries have the highest representative of using no modality or online only, 19.7 and 23.4% of their respective total. Radio appears to be a tool used more in low and low-middle income countries.
Limitations/Next Steps: I expected to see rising incomes correspond to more access; however I will need to investigate the countries choosing none. Appropriate levels of distance learning access could impact a communities ability to use this option when in-person learning is halted.
The goal here is to see if there is a relationship between types of access and the amount of weeks a school system is closed.
#Weeks fully closed by distance learning modality
unesco_short %>%
ggplot(aes(`Distance learning modalities (Global)`, `Weeks fully closed`))+
geom_boxplot()+
labs(title = "No. of Weeks Fully Closed by Distance Learning Modalities")+
guides(x = guide_axis(n.dodge = 2))
Overall we see the majority of the distribution between the 10 - 40 weeks regardless of the technology available. Locations with none had the fewest weeks fully closed followed by those with Radio only. Alternatively, those with TV+Online+Radio saw longer school closures (especially among the outliers).
Limitations/ Next Steps: I think adding the mean could be helpful to the reader.
Finally for Blog#5, I will add in the time element to review changes over the course of 2 years.
Source UNESCO map on school closures [https://en.unesco.org/covid19/educationresponse] and UIS, March 2022 [http://data.uis.unesco.org]
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Muhammad (2022, April 27). Data Analytics and Computational Social Science: KMuhammad_HW4. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomkmuhamma895068/
BibTeX citation
@misc{muhammad2022kmuhammad_hw4, author = {Muhammad, Kalimah}, title = {Data Analytics and Computational Social Science: KMuhammad_HW4}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomkmuhamma895068/}, year = {2022} }