Data Analytics and Computational Social Science: HW 4

Karen Kimble

Reading in the Dataset & Cleaining

SPI <- read_excel("/Users/karenkimble/Documents/R Practice/Social Progress Index.xlsx", sheet = "2011-2021 data")

SPI$...10 <- NULL
SPI$...23 <- NULL

colnames(SPI) <- c("Rank",
                   "Country",
                   "Code",
                   "Year",
                   "Status",
                   "SPI",
                   "Needs",
                   "Wellbeing",
                   "Opportunity",
                   "Nutrition/care",
                   "Sanitation",
                   "Shelter",
                   "Safety",
                   "Access-knowledge",
                   "Info-comm",
                   "Health",
                   "Environment",
                   "Rights",
                   "Choice",
                   "Inclusiveness",
                   "Advanced-ed",
                   "Infectious",
                   "Child mortality",
                   "Stunting",
                   "Maternal-mortality",
                   "Undernourishment",
                   "Improved-sanitation",
                   "Improved-water",
                   "Hygeine-deaths",
                   "Pollution-deaths",
                   "Housing",
                   "Electricity",
                   "Clean-fuels",
                   "Personal-violence-deaths",
                   "Transport",
                   "Criminality",
                   "Political-killings",
                   "Women-no-education",
                   "Education-access",
                   "Primary-enrollment",
                   "Secondary-attainment",
                   "Gender-gap-secondary",
                   "Online-governance",
                   "Internet-users",
                   "Media",
                   "Cellphone",
                   "Life-expectancy",
                   "Premature-deaths",
                   "Healthcare",
                   "Essential-services",
                   "Pollution",
                   "Lead",
                   "Particulate",
                   "Species",
                   "Justice",
                   "Expression",
                   "Religion",
                   "Political-rights",
                   "Property",
                   "Contraception",
                   "Corruption",
                   "Early-marriage",
                   "Youth-nonemployed",
                   "Vulnerable",
                   "Equal-gender",
                   "Equal-social",
                   "Equal-socioeconomic",
                   "Discrimination-violence",
                   "LGBT",
                   "Citable-docs",
                   "Academic",
                   "Women-advanced",
                   "Tertiary",
                   "Quality-unis")

Summary Statistics

Because of the sheer amount of variables within this dataset, I will be only be focusing on one category of the SPI’s three major categories: Foundations of Wellbeing. The other two categories, Basic Needs and Opportunity, are still important and should be analyzed. However, I am primarily interested in the Foundations of Wellbeing category, which includes indicators related to access to knowledge and infrastructure as well as health, because it may be interesting to see if countries generally viewed as more “free” and democratic will do well in those categories (such as the United States or some European Union countries). There are still a lot of variables condensed into the Foundations of Wellbeing category, so I will analyze the main variables that are computed using their sub-categories. Those variables are: Access to Basic Knowledge, Access to Information and Communications, Health and Wellness, and Environmental Quality.

# Access to Knowledge

summary(SPI$`Access-knowledge`)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  15.27   59.09   77.52   73.24   91.05   99.51     369

sd(SPI$`Access-knowledge`, na.rm=TRUE)

[1] 20.56969

# If I wanted to look at each country's score as an average all scores

SPI %>%
  group_by(`Country`) %>%
  summarise(mean = mean(`Access-knowledge`))

# A tibble: 205 × 2
   Country              mean
   <chr>               <dbl>
 1 Albania              89.8
 2 Algeria              73.4
 3 American Samoa       NA  
 4 Andorra              NA  
 5 Angola               47.8
 6 Antigua and Barbuda  NA  
 7 Argentina            84.6
 8 Armenia              93.8
 9 Australia            95.0
10 Austria              96.3
# … with 195 more rows

# Access to Information & Communications

summary(SPI$`Info-comm`)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.28   40.17   58.16   57.13   75.66   98.91     358

sd(SPI$`Info-comm`, na.rm=TRUE)

[1] 22.60447

# Looking at the average information and communications score for each country

SPI %>%
  group_by(`Country`) %>%
  summarise(mean = mean(`Info-comm`))

# A tibble: 205 × 2
   Country              mean
   <chr>               <dbl>
 1 Albania              66.8
 2 Algeria              45.5
 3 American Samoa       NA  
 4 Andorra              NA  
 5 Angola               29.8
 6 Antigua and Barbuda  NA  
 7 Argentina            71.7
 8 Armenia              60.3
 9 Australia            92.2
10 Austria              85.0
# … with 195 more rows

# Health & Wellness

summary(SPI$`Health`)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  15.19   47.28   61.22   60.73   72.52   92.10     347

sd(SPI$`Health`, na.rm=TRUE)

[1] 16.51259

# Looking at the average health and wellness score of all years for each country

SPI %>%
  group_by(`Country`) %>%
  summarise(mean = mean(`Health`))

# A tibble: 205 × 2
   Country              mean
   <chr>               <dbl>
 1 Albania              72.4
 2 Algeria              68.7
 3 American Samoa       NA  
 4 Andorra              NA  
 5 Angola               41.5
 6 Antigua and Barbuda  NA  
 7 Argentina            68.2
 8 Armenia              66.0
 9 Australia            86.5
10 Austria              85.7
# … with 195 more rows

# Environmental Quality

summary(SPI$`Environment`)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  20.82   58.03   66.87   66.35   76.06   95.15     325

sd(SPI$`Environment`, na.rm=TRUE)

[1] 13.93559

# Looking at average score for each country

SPI %>%
  group_by(`Country`) %>%
  summarise(mean = mean(`Environment`))

# A tibble: 205 × 2
   Country              mean
   <chr>               <dbl>
 1 Albania              71.0
 2 Algeria              51.7
 3 American Samoa       NA  
 4 Andorra              NA  
 5 Angola               62.9
 6 Antigua and Barbuda  NA  
 7 Argentina            78.0
 8 Armenia              58.3
 9 Australia            85.1
10 Austria              83.4
# … with 195 more rows

Visualizations

Average scores worldwide by year

By looking at how the world is doing as a whole from 2011-2021, we can get an idea of what the improvement overall has been like and compare that to individual countries’ progress.

# Seeing if average Access to Knowledge scores are changing over time
averageSPIAK <- SPI %>%
  group_by(`Year`) %>%
  summarise(`Avg AK` = mean(`Access-knowledge`, na.rm=TRUE))

ggplot(data = averageSPIAK,
       mapping=aes(x = `Year`, y = `Avg AK`)) +
  geom_point(color = "dark blue")

# What about Access to Information and Communications?

averageSPIIC <- SPI %>%
  group_by(`Year`) %>%
  summarise(`Avg IC` = mean(`Info-comm`, na.rm=TRUE))

ggplot(data = averageSPIIC,
       mapping = aes(x = `Year`, y = `Avg IC`)) +
  geom_point(color = "dark red")

# Looking at Health and Wellness

averageSPIHW <- SPI %>%
  group_by(`Year`) %>%
  summarise(`Avg HW` = mean(`Health`, na.rm=TRUE))

ggplot(data = averageSPIHW,
       mapping = aes(x = `Year`, y = `Avg HW`)) +
  geom_point(color = "purple")

# Lastly, looking at Environmental Quality

averageSPIEQ <- SPI %>%
  group_by(`Year`) %>%
  summarise(`Avg EQ` = mean(`Environment`, na.rm=TRUE))

ggplot(data = averageSPIEQ,
       mapping = aes(x = `Year`, y = `Avg EQ`)) +
  geom_point(color = "dark green")

All of these plots show that there has been improvement across all categories, but not all of them have been consistent and they have all been exponential. Something left out is how each country has improved over the years. I also could have chosen a different metric, such as a median, which can give a different type of insight since means may be skewed due to outliers. Additionally, there aren’t a lot of years included in the dataset compared to the length of human history, so some more historical data could be valuable.

Variation by Country

Since there are a great many countries in the dataset and I don’t want there to be an overcrowded graph, I will select a few countries to look at. I’ll base my selection on the largest countries by population in their respective continent so there is some similarity between them: China, Russia, the United States, Brazil, Nigeria, and Australia.

SPI_Large <- SPI %>%
  filter(`Country` %in% c("China", "Russia", "Brazil", "Nigeria", "Australia",
                          "United States"))

head(SPI_Large)

# A tibble: 6 × 74
   Rank Country   Code   Year Status   SPI Needs Wellbeing Opportunity
  <dbl> <chr>     <chr> <dbl> <chr>  <dbl> <dbl>     <dbl>       <dbl>
1    11 Australia AUS    2021 Ranked  90.3  95.1      90.4        85.3
2    10 Australia AUS    2020 Ranked  90.1  95.1      90.5        84.8
3    12 Australia AUS    2019 Ranked  90    95        90.2        84.8
4    11 Australia AUS    2018 Ranked  89.9  94.6      90.6        84.5
5    10 Australia AUS    2017 Ranked  90.0  95.1      90.2        84.6
6    10 Australia AUS    2016 Ranked  89.8  95.2      89.9        84.5
# … with 65 more variables: `Nutrition/care` <dbl>, Sanitation <dbl>,
#   Shelter <dbl>, Safety <dbl>, `Access-knowledge` <dbl>,
#   `Info-comm` <dbl>, Health <dbl>, Environment <dbl>, Rights <dbl>,
#   Choice <dbl>, Inclusiveness <dbl>, `Advanced-ed` <dbl>,
#   Infectious <dbl>, `Child mortality` <dbl>, Stunting <dbl>,
#   `Maternal-mortality` <dbl>, Undernourishment <dbl>,
#   `Improved-sanitation` <dbl>, `Improved-water` <dbl>, …

By looking at overall rankings over time, there can be a good general idea how these coutries have done in comparison to the others in all indicators, not just a few.

ggplot(data = SPI_Large, mapping=aes(x = `Year`, y = `Rank`, color = `Country`)) +
  geom_line() +
  facet_wrap(facets = vars(`Country`))

(it is important to note that a low rank means the country is doing better than the others and a higher number means it is doing worse)

From the above, we can see that Nigeria has consistently ranked very poorly with very little improvement. Brazil had a slightly better-than-middle ranking, but then was suddenly ranked worse in 2017 and continued to trend poorer every year since. China and Russia, on the other hand, seem pretty stagnant with consistent rankings throughout the years–China doing worse than Russia. Australia has the best consistent rankings out of all the countries, while the US was a close second but has started to be ranked poorly in 2015 or so and on. I think it’s interesting to look at these comparisons when thinking about overall rankings because it makes me wonder what is dragging down or boosting up scores for each country. Something left unanswered is what other countries in the same continent are like for rankings, what caused these rankings to drop, and what categories some countries do better in than others. A general view is helpful but does not tell everything.

Comment on this article Share:

HW 4

Reading in the Dataset & Cleaining

Summary Statistics

Visualizations

Average scores worldwide by year

Variation by Country

Reuse

Citation