This is my homework four for DACSS 601
SPI <- read_excel("/Users/karenkimble/Documents/R Practice/Social Progress Index.xlsx", sheet = "2011-2021 data")
SPI$...10 <- NULL
SPI$...23 <- NULL
colnames(SPI) <- c("Rank",
"Country",
"Code",
"Year",
"Status",
"SPI",
"Needs",
"Wellbeing",
"Opportunity",
"Nutrition/care",
"Sanitation",
"Shelter",
"Safety",
"Access-knowledge",
"Info-comm",
"Health",
"Environment",
"Rights",
"Choice",
"Inclusiveness",
"Advanced-ed",
"Infectious",
"Child mortality",
"Stunting",
"Maternal-mortality",
"Undernourishment",
"Improved-sanitation",
"Improved-water",
"Hygeine-deaths",
"Pollution-deaths",
"Housing",
"Electricity",
"Clean-fuels",
"Personal-violence-deaths",
"Transport",
"Criminality",
"Political-killings",
"Women-no-education",
"Education-access",
"Primary-enrollment",
"Secondary-attainment",
"Gender-gap-secondary",
"Online-governance",
"Internet-users",
"Media",
"Cellphone",
"Life-expectancy",
"Premature-deaths",
"Healthcare",
"Essential-services",
"Pollution",
"Lead",
"Particulate",
"Species",
"Justice",
"Expression",
"Religion",
"Political-rights",
"Property",
"Contraception",
"Corruption",
"Early-marriage",
"Youth-nonemployed",
"Vulnerable",
"Equal-gender",
"Equal-social",
"Equal-socioeconomic",
"Discrimination-violence",
"LGBT",
"Citable-docs",
"Academic",
"Women-advanced",
"Tertiary",
"Quality-unis")
Because of the sheer amount of variables within this dataset, I will be only be focusing on one category of the SPI’s three major categories: Foundations of Wellbeing. The other two categories, Basic Needs and Opportunity, are still important and should be analyzed. However, I am primarily interested in the Foundations of Wellbeing category, which includes indicators related to access to knowledge and infrastructure as well as health, because it may be interesting to see if countries generally viewed as more “free” and democratic will do well in those categories (such as the United States or some European Union countries). There are still a lot of variables condensed into the Foundations of Wellbeing category, so I will analyze the main variables that are computed using their sub-categories. Those variables are: Access to Basic Knowledge, Access to Information and Communications, Health and Wellness, and Environmental Quality.
# Access to Knowledge
summary(SPI$`Access-knowledge`)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
15.27 59.09 77.52 73.24 91.05 99.51 369
sd(SPI$`Access-knowledge`, na.rm=TRUE)
[1] 20.56969
# If I wanted to look at each country's score as an average all scores
SPI %>%
group_by(`Country`) %>%
summarise(mean = mean(`Access-knowledge`))
# A tibble: 205 × 2
Country mean
<chr> <dbl>
1 Albania 89.8
2 Algeria 73.4
3 American Samoa NA
4 Andorra NA
5 Angola 47.8
6 Antigua and Barbuda NA
7 Argentina 84.6
8 Armenia 93.8
9 Australia 95.0
10 Austria 96.3
# … with 195 more rows
# Access to Information & Communications
summary(SPI$`Info-comm`)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.28 40.17 58.16 57.13 75.66 98.91 358
sd(SPI$`Info-comm`, na.rm=TRUE)
[1] 22.60447
# Looking at the average information and communications score for each country
SPI %>%
group_by(`Country`) %>%
summarise(mean = mean(`Info-comm`))
# A tibble: 205 × 2
Country mean
<chr> <dbl>
1 Albania 66.8
2 Algeria 45.5
3 American Samoa NA
4 Andorra NA
5 Angola 29.8
6 Antigua and Barbuda NA
7 Argentina 71.7
8 Armenia 60.3
9 Australia 92.2
10 Austria 85.0
# … with 195 more rows
# Health & Wellness
summary(SPI$`Health`)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
15.19 47.28 61.22 60.73 72.52 92.10 347
sd(SPI$`Health`, na.rm=TRUE)
[1] 16.51259
# Looking at the average health and wellness score of all years for each country
SPI %>%
group_by(`Country`) %>%
summarise(mean = mean(`Health`))
# A tibble: 205 × 2
Country mean
<chr> <dbl>
1 Albania 72.4
2 Algeria 68.7
3 American Samoa NA
4 Andorra NA
5 Angola 41.5
6 Antigua and Barbuda NA
7 Argentina 68.2
8 Armenia 66.0
9 Australia 86.5
10 Austria 85.7
# … with 195 more rows
# Environmental Quality
summary(SPI$`Environment`)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
20.82 58.03 66.87 66.35 76.06 95.15 325
sd(SPI$`Environment`, na.rm=TRUE)
[1] 13.93559
# Looking at average score for each country
SPI %>%
group_by(`Country`) %>%
summarise(mean = mean(`Environment`))
# A tibble: 205 × 2
Country mean
<chr> <dbl>
1 Albania 71.0
2 Algeria 51.7
3 American Samoa NA
4 Andorra NA
5 Angola 62.9
6 Antigua and Barbuda NA
7 Argentina 78.0
8 Armenia 58.3
9 Australia 85.1
10 Austria 83.4
# … with 195 more rows
By looking at how the world is doing as a whole from 2011-2021, we can get an idea of what the improvement overall has been like and compare that to individual countries’ progress.
# Seeing if average Access to Knowledge scores are changing over time
averageSPIAK <- SPI %>%
group_by(`Year`) %>%
summarise(`Avg AK` = mean(`Access-knowledge`, na.rm=TRUE))
ggplot(data = averageSPIAK,
mapping=aes(x = `Year`, y = `Avg AK`)) +
geom_point(color = "dark blue")
# What about Access to Information and Communications?
averageSPIIC <- SPI %>%
group_by(`Year`) %>%
summarise(`Avg IC` = mean(`Info-comm`, na.rm=TRUE))
ggplot(data = averageSPIIC,
mapping = aes(x = `Year`, y = `Avg IC`)) +
geom_point(color = "dark red")
# Looking at Health and Wellness
averageSPIHW <- SPI %>%
group_by(`Year`) %>%
summarise(`Avg HW` = mean(`Health`, na.rm=TRUE))
ggplot(data = averageSPIHW,
mapping = aes(x = `Year`, y = `Avg HW`)) +
geom_point(color = "purple")
# Lastly, looking at Environmental Quality
averageSPIEQ <- SPI %>%
group_by(`Year`) %>%
summarise(`Avg EQ` = mean(`Environment`, na.rm=TRUE))
ggplot(data = averageSPIEQ,
mapping = aes(x = `Year`, y = `Avg EQ`)) +
geom_point(color = "dark green")
All of these plots show that there has been improvement across all categories, but not all of them have been consistent and they have all been exponential. Something left out is how each country has improved over the years. I also could have chosen a different metric, such as a median, which can give a different type of insight since means may be skewed due to outliers. Additionally, there aren’t a lot of years included in the dataset compared to the length of human history, so some more historical data could be valuable.
Since there are a great many countries in the dataset and I don’t want there to be an overcrowded graph, I will select a few countries to look at. I’ll base my selection on the largest countries by population in their respective continent so there is some similarity between them: China, Russia, the United States, Brazil, Nigeria, and Australia.
SPI_Large <- SPI %>%
filter(`Country` %in% c("China", "Russia", "Brazil", "Nigeria", "Australia",
"United States"))
head(SPI_Large)
# A tibble: 6 × 74
Rank Country Code Year Status SPI Needs Wellbeing Opportunity
<dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 11 Australia AUS 2021 Ranked 90.3 95.1 90.4 85.3
2 10 Australia AUS 2020 Ranked 90.1 95.1 90.5 84.8
3 12 Australia AUS 2019 Ranked 90 95 90.2 84.8
4 11 Australia AUS 2018 Ranked 89.9 94.6 90.6 84.5
5 10 Australia AUS 2017 Ranked 90.0 95.1 90.2 84.6
6 10 Australia AUS 2016 Ranked 89.8 95.2 89.9 84.5
# … with 65 more variables: `Nutrition/care` <dbl>, Sanitation <dbl>,
# Shelter <dbl>, Safety <dbl>, `Access-knowledge` <dbl>,
# `Info-comm` <dbl>, Health <dbl>, Environment <dbl>, Rights <dbl>,
# Choice <dbl>, Inclusiveness <dbl>, `Advanced-ed` <dbl>,
# Infectious <dbl>, `Child mortality` <dbl>, Stunting <dbl>,
# `Maternal-mortality` <dbl>, Undernourishment <dbl>,
# `Improved-sanitation` <dbl>, `Improved-water` <dbl>, …
By looking at overall rankings over time, there can be a good general idea how these coutries have done in comparison to the others in all indicators, not just a few.
ggplot(data = SPI_Large, mapping=aes(x = `Year`, y = `Rank`, color = `Country`)) +
geom_line() +
facet_wrap(facets = vars(`Country`))
(it is important to note that a low rank means the country is doing better than the others and a higher number means it is doing worse)
From the above, we can see that Nigeria has consistently ranked very poorly with very little improvement. Brazil had a slightly better-than-middle ranking, but then was suddenly ranked worse in 2017 and continued to trend poorer every year since. China and Russia, on the other hand, seem pretty stagnant with consistent rankings throughout the years–China doing worse than Russia. Australia has the best consistent rankings out of all the countries, while the US was a close second but has started to be ranked poorly in 2015 or so and on. I think it’s interesting to look at these comparisons when thinking about overall rankings because it makes me wonder what is dragging down or boosting up scores for each country. Something left unanswered is what other countries in the same continent are like for rankings, what caused these rankings to drop, and what categories some countries do better in than others. A general view is helpful but does not tell everything.
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Kimble (2022, April 27). Data Analytics and Computational Social Science: HW 4. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomkkimble894180/
BibTeX citation
@misc{kimble2022hw, author = {Kimble, Karen}, title = {Data Analytics and Computational Social Science: HW 4}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomkkimble894180/}, year = {2022} }