Identifying data for Final Project, and Generating Visualuals
This analysis will combine country level data to see how cultural indicators drive COVID-19 vaccination rates around the world, and how different health systems, and strength of those impact that relationship. I will combine three different datasets at the country level:
1: Cultural indicators from Geert Hofstede’s Cultural Dimensions Theory Data https://data.world/adamhelsinger/geerthofstedeculturaldimension
2: National Vaccination Rates from Open-Source Github Repo: https://github.com/owid/covid-19-data
3: National Health System Classification Data - TBD
4: Global Country Indicators (Income, Income Per Capita, Government Type (indexes?), Population etc…) - TBD
culture_indices <- read.csv2("https://query.data.world/s/c4dbzx65xq3bnkf65gnutajtb2566a", header=TRUE, stringsAsFactors=FALSE) %>%
na_if("#NULL!") %>%
mutate(country = replace(country, country == "U.S.A.", "United States"))
head(df)
1 function (x, df1, df2, ncp, log = FALSE)
2 {
3 if (missing(ncp))
4 .Call(C_df, x, df1, df2, log)
5 else .Call(C_dnf, x, df1, df2, ncp, log)
6 }
"https://github.com/owid/covid-19-data/tree/master/public/data/vaccinations/country_data" %>%
read_html() %>%
html_nodes(xpath = '//*[@role="rowheader"]') %>%
html_nodes('span a') %>%
html_attr('href') %>%
sub('blob/', '', .) %>%
paste0('https://raw.githubusercontent.com', .) %>%
purrr::map_df(read.csv) -> vaccine_data
head(vaccine_data)
location date
1 Afghanistan 2021-02-22
2 Afghanistan 2021-02-28
3 Afghanistan 2021-03-16
4 Afghanistan 2021-04-07
5 Afghanistan 2021-04-22
6 Afghanistan 2021-05-11
vaccine
1 Oxford/AstraZeneca
2 Oxford/AstraZeneca
3 Oxford/AstraZeneca
4 Oxford/AstraZeneca
5 Oxford/AstraZeneca
6 Oxford/AstraZeneca, Pfizer/BioNTech, Sinopharm/Beijing
source_url
1 https://tolonews.com/index.php/health-170225
2 https://tolonews.com/index.php/health-170355
3 http://www.xinhuanet.com/english/asiapacific/2021-03/16/c_139814668.htm
4 http://www.xinhuanet.com/english/asiapacific/2021-04/07/c_139864755.htm
5 https://reliefweb.int/report/afghanistan/afghanistan-strategic-situation-report-covid-19-no-95-22-april-2021
6 https://covid19.who.int/
total_vaccinations people_vaccinated people_fully_vaccinated
1 0 0 NA
2 8200 8200 NA
3 54000 54000 NA
4 120000 120000 NA
5 240000 240000 NA
6 504502 448878 55624
total_boosters people_partly_vaccinated
1 NA NA
2 NA NA
3 NA NA
4 NA NA
5 NA NA
6 NA NA
Used “wbstat” package for ease of access to World Bank API - https://cran.r-project.org/web/packages/wbstats/vignettes/wbstats.html
GHED_PVT-DCHE_SHA2011 - Domestic private health expenditure (PVT-D) as percentage of current health expenditure (CHE) (%)
GHED_GGHE-DCHE_SHA2011 - Domestic general government health expenditure (GGHE-D) as percentage of current health expenditure (CHE) (%)
GHED_GGHE-DGGE_SHA2011 - Domestic general government health expenditure (GGHE-D) as percentage of general government expenditure (GGE) (%)
GHED_PVT-D_pc_PPP_SHA2011 - Domestic private health expenditure (PVT-D) per capita in PPP int$
GHED_GGHE-D_pc_PPP_SHA2011 - Domestic general government health expenditure (GGHE-D) per capita in PPP int$
GHED_CHEGDP_SHA2011 - Current health expenditure (CHE) as percentage of gross domestic product (GDP) (%)
GHED_EXT_pc_PPP_SHA2011 - External health expenditure (EXT) per capita in PPP int$
GDO_q9x5_1 - Density of outpatient health centres (per 100,000 population)
DEVICES02 - Total density per 100 000 population: Health centres
who_health_indicators<-fromJSON("https://ghoapi.azureedge.net/api/Indicator?$filter=contains(IndicatorName,%27health%27)")$value
Health_System_Data <- c('GHED_PVT-DCHE_SHA2011','GHED_GGHE-DCHE_SHA2011','GHED_GGHE-DGGE_SHA2011','GHED_PVT-D_pc_PPP_SHA2011','GHED_GGHE-D_pc_PPP_SHA2011','GHED_CHEGDP_SHA2011','GHED_EXT_pc_PPP_SHA2011','GDO_q9x5_1','DEVICES02') %>%
paste0('https://ghoapi.azureedge.net/api/', .) %>%
purrr::map_df({function(x) fromJSON(x)$value}) %>%
left_join(who_health_indicators %>%
select(IndicatorCode,IndicatorName), by = 'IndicatorCode') %>%
left_join(fromJSON('https://ghoapi.azureedge.net/api/DIMENSION/COUNTRY/DimensionValues')$value[c("Code","Title")], by = c('SpatialDim'='Code')) %>%
rename(country = Title, year = TimeDim, who_indicator = IndicatorName, who_indicator_code = IndicatorCode, who_indicator_value = Value ) %>%
arrange(desc(year)) %>%
distinct(country, who_indicator, who_indicator_code,.keep_all = TRUE) %>%
select(country, who_indicator, who_indicator_code,who_indicator_value)
df <- vaccine_data %>%
select(location, date, people_vaccinated, people_fully_vaccinated, people_partly_vaccinated) %>%
rename(country = location) %>%
# arrange(desc(date)) %>%
# distinct(country, .keep_all= TRUE) %>%
full_join(culture_indices %>%
select(country, pdi, idv, mas, uai, ltowvs, ivr),
by='country') %>%
full_join(pop_gdp %>%
select(country, gdp, pop, gdp_per_capita),
by = 'country') %>%
mutate(pop_vax_rate = people_vaccinated/pop, pop_part_vax_rate = people_partly_vaccinated/pop, pop_full_vax_rate = people_fully_vaccinated/pop, date = as.Date(date), year = as.integer(format(as.Date(date), format = "%Y")),gdp_quartile = ntile(gdp_per_capita, 4)) %>%
left_join(Health_System_Data, by = 'country') %>%
pivot_longer(c(pdi, idv, mas, uai, ltowvs, ivr),names_to = 'culture_index', values_to = "culture_index_value") %>%
mutate(culture_index_value = as.numeric(culture_index_value))
print(head(df))
# A tibble: 6 × 18
country date people_vaccinated people_fully_vaccinated
<chr> <date> <int> <int>
1 Afghanistan 2021-02-22 0 NA
2 Afghanistan 2021-02-22 0 NA
3 Afghanistan 2021-02-22 0 NA
4 Afghanistan 2021-02-22 0 NA
5 Afghanistan 2021-02-22 0 NA
6 Afghanistan 2021-02-22 0 NA
# … with 14 more variables: people_partly_vaccinated <int>,
# gdp <dbl>, pop <dbl>, gdp_per_capita <dbl>, pop_vax_rate <dbl>,
# pop_part_vax_rate <dbl>, pop_full_vax_rate <dbl>, year <int>,
# gdp_quartile <int>, who_indicator <chr>,
# who_indicator_code <chr>, who_indicator_value <chr>,
# culture_index <chr>, culture_index_value <dbl>
country date people_vaccinated
Length:278 Min. :2021-02-01 Min. : 47
Class :character 1st Qu.:2022-02-17 1st Qu.: 265860
Mode :character Median :2022-02-19 Median : 2270897
Mean :2022-02-09 Mean : 18811059
3rd Qu.:2022-02-21 3rd Qu.: 9643354
Max. :2022-02-21 Max. :962967313
NA's :54 NA's :89
people_fully_vaccinated people_partly_vaccinated gdp
Min. : 47 Min. : NA Min. :4.886e+07
1st Qu.: 234910 1st Qu.: NA 1st Qu.:7.780e+09
Median : 2038507 Median : NA Median :3.119e+10
Mean : 16083651 Mean :NaN Mean :4.310e+11
3rd Qu.: 8042658 3rd Qu.: NA 3rd Qu.:2.031e+11
Max. :774808452 Max. : NA Max. :2.095e+13
NA's :85 NA's :278 NA's :84
pop gdp_per_capita pop_vax_rate
Min. :1.083e+04 Min. : 239 Min. :0.00075
1st Qu.:7.828e+05 1st Qu.: 2170 1st Qu.:0.44518
Median :6.725e+06 Median : 5467 Median :0.67689
Mean :3.581e+07 Mean : 15312 Mean :0.59411
3rd Qu.:2.571e+07 3rd Qu.: 17700 3rd Qu.:0.77034
Max. :1.411e+09 Max. :173688 Max. :1.23787
NA's :62 NA's :84 NA's :128
pop_part_vax_rate pop_full_vax_rate year gdp_quartile
Min. : NA Min. :0.00071 Min. :2021 Min. :1.000
1st Qu.: NA 1st Qu.:0.38049 1st Qu.:2022 1st Qu.:1.000
Median : NA Median :0.58409 Median :2022 Median :2.000
Mean :NaN Mean :0.52879 Mean :2022 Mean :2.113
3rd Qu.: NA 3rd Qu.:0.71747 3rd Qu.:2022 3rd Qu.:3.000
Max. : NA Max. :1.21219 Max. :2022 Max. :4.000
NA's :278 NA's :124 NA's :54 NA's :84
who_indicator who_indicator_code who_indicator_value
Length:278 Length:278 Length:278
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
culture_index culture_index_value
Length:278 Min. : 11.00
Class :character 1st Qu.: 42.50
Mode :character Median : 62.00
Mean : 59.33
3rd Qu.: 72.50
Max. :104.00
NA's :200
# Keep only 3 names
plot_df <- df %>%
arrange(gdp_per_capita) %>%
filter(!is.na(culture_index_value) & !is.na(pop_vax_rate)) %>%
select(date,pop_vax_rate, country,gdp_quartile) %>%
distinct()
p <- vector('list', 4)
i <- 1
quartiles <- c('1st Quartile (Poorest)','2nd Quartile','3rd Quartile','4th Quartile (Wealthiest)')
colors <- c('YlOrRd','BuGn','BuPu','PuRd')
for (var in unique(plot_df$gdp_quartile)) {
dev.new()
p[[i]] <- plot_df %>%
filter(gdp_quartile == var) %>%
ggplot(
aes(x=date, y=pop_vax_rate, group=country, color=country))+
ggtitle(quartiles[i]) +
theme(legend.position = "none") +
geom_line() +
scale_colour_brewer(palette = colors[i] ) +
geom_hline(yintercept = .7) +
ylim(0,1) +
xlim(as.Date(c(min(plot_df$date), max(plot_df$date)), format="%d/%m/%Y") )
i <- i+1
}
grid.arrange(p[[1]], p[[2]],p[[3]],p[[4]], nrow = 2,top = textGrob("Vaccination Progress by Country Wealth",gp=gpar(fontsize=20,font=3)))
GRID NOT PRINTING???
This graph shows the relationship between vaccination rate progress and national wealth, measured by GDP per capita. A reference line is included here to denote a 70% vaccination rate (“heard immunity”, more or less(. We can clearly see that the wealthiest nations achieved heard immunity much earlier than all other groups, while many 1st, 2nd and 3rd quartile countries still have not achieved heard immunity. This graph is unable to capture the impact of other confounding factors that may be impacting a nations population vaccination rate progress. Furthermore, we are unable to say if the differences we we see between income groups in vaccination progress is statistically significant.
corr_df <- df %>%
arrange(desc(date)) %>%
distinct(country, .keep_all = TRUE) %>%
select(country, pop_vax_rate, gdp_per_capita) %>%
na.omit()
p <- corr_df %>%
ggplot() +
ggtitle("National Vaccination Rates by GDP Per Capita") +
geom_point(aes(x = gdp_per_capita, y = pop_vax_rate, color = country, text = paste("country:", country))) +
stat_smooth(aes(x = gdp_per_capita, y = pop_vax_rate), method = "lm", col = "red", se = FALSE) +
scale_x_log10()
fig <- ggplotly(p)
fig
Here we see a positive relationship between GDP per Capita and national vaccination rates. This implies (as would be expected) that wealthier nations have an easier time getting their population vaccinated, or are able to get their population vaccinated more quickly, given that many vaccination campaigns are still in progress. This visualization does not provide us with any statistical verification of the relationship that it presented visually.
corr_df <- df %>%
arrange(desc(date)) %>%
filter(culture_index == 'idv') %>%
distinct(country, .keep_all = TRUE) %>%
select(country, pop_vax_rate, culture_index_value) %>%
na.omit()
p <- corr_df %>%
ggplot() +
ggtitle("National Vaccination Rates by Collectivist/Individualist Index") +
geom_point(aes(x = culture_index_value, y = pop_vax_rate, color = country, text = paste("country:", country))) +
stat_smooth(aes(x = culture_index_value, y = pop_vax_rate), method = "lm", col = "red", se = FALSE)
fig <- ggplotly(p)
fig
This scatterplot shows that as countries become more individualistic, and less collectivist, they appear to have more success vaccinating their population. This relationship does not appear to be that strong, and we are unable to verify if it is statistically significant or not. For my final project, I hope to include a button in this graph that allows the user to rapidly switch between the 6 cultural indices.
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Markowitz (2022, Feb. 23). Data Analytics and Computational Social Science: HW4. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomamarkowitzhw4/
BibTeX citation
@misc{markowitz2022hw4, author = {Markowitz, Ari}, title = {Data Analytics and Computational Social Science: HW4}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomamarkowitzhw4/}, year = {2022} }