Final Project

Final Project
An analysis of world data record
Author

Tenzin Latoe

Published

July 16, 2023

library(tidyverse)
library(ggplot2)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Introduction

For my final project, I will be analyzing a comprehensive dataset that provides various socio-economic and demographic information from 195 countries. This data set encompasses a wide range of information, including statistics related to population characteristics, environmental aspects,economic factors, educational indicators, healthcare measures,and various other data points.

Data Description

The data set I used is from the website Kaggle and is titled “Global Country Information Dataset 2023”. The are 195 countries represented, and the information provided for each country is: Density (P/Km2), Abbreviation, Agricultural Land (%), Land Area (Km2), Armed Forces Size, Birth Rate, Calling Code, Capital/Major City, CO2 Emissions, CPI: CPI Change (%), Currency_Code, Fertility Rate, Forested Area (%), Gasoline_Price, GDP:Gross Primary Education Enrollment (%), Gross Tertiary Education Enrollment (%), Infant Mortality, Largest City, Life Expectancy, Maternal Mortality Ratio, Minimum Wage, Official Language, Out of Pocket Health Expenditure (%), Physicians per Thousand, Population, Population, Labor Force Participation (%), Tax Revenue (%), Total Tax Rate, Unemployment Rate, Urban Population, Latitude and Longitude.

world_data <- read_csv ("_data/world-data-2023.csv")
world_data
print(summarytools::dfSummary(world_data,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

world_data

Dimensions: 195 x 35
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
Country [character]
1. Afghanistan
2. Albania
3. Algeria
4. Andorra
5. Angola
6. Antigua and Barbuda
7. Argentina
8. Armenia
9. Australia
10. Austria
[ 185 others ]
1 ( 0.5% )
1 ( 0.5% )
1 ( 0.5% )
1 ( 0.5% )
1 ( 0.5% )
1 ( 0.5% )
1 ( 0.5% )
1 ( 0.5% )
1 ( 0.5% )
1 ( 0.5% )
185 ( 94.9% )
0 (0.0%)
Density (P/Km2) [numeric]
Mean (sd) : 356.8 (1982.9)
min ≤ med ≤ max:
2 ≤ 89 ≤ 26337
IQR (CV) : 181 (5.6)
137 distinct values 0 (0.0%)
Abbreviation [character]
1. AD
2. AE
3. AF
4. AG
5. AL
6. AM
7. AO
8. AR
9. AT
10. AU
[ 178 others ]
1 ( 0.5% )
1 ( 0.5% )
1 ( 0.5% )
1 ( 0.5% )
1 ( 0.5% )
1 ( 0.5% )
1 ( 0.5% )
1 ( 0.5% )
1 ( 0.5% )
1 ( 0.5% )
178 ( 94.7% )
7 (3.6%)
Agricultural Land( %) [character]
1. 17.40%
2. 2.70%
3. 23.10%
4. 23.30%
5. 25.60%
6. 26.30%
7. 28.70%
8. 31.10%
9. 32.40%
10. 33.30%
[ 158 others ]
3 ( 1.6% )
2 ( 1.1% )
2 ( 1.1% )
2 ( 1.1% )
2 ( 1.1% )
2 ( 1.1% )
2 ( 1.1% )
2 ( 1.1% )
2 ( 1.1% )
2 ( 1.1% )
167 ( 88.8% )
7 (3.6%)
Land Area(Km2) [numeric]
Mean (sd) : 689624.4 (1921609)
min ≤ med ≤ max:
0 ≤ 119511 ≤ 17098240
IQR (CV) : 500427.8 (2.8)
194 distinct values 1 (0.5%)
Armed Forces size [numeric]
Mean (sd) : 159274.9 (380628.8)
min ≤ med ≤ max:
0 ≤ 31000 ≤ 3031000
IQR (CV) : 131000 (2.4)
105 distinct values 24 (12.3%)
Birth Rate [numeric]
Mean (sd) : 20.2 (9.9)
min ≤ med ≤ max:
5.9 ≤ 18 ≤ 46.1
IQR (CV) : 17.4 (0.5)
170 distinct values 6 (3.1%)
Calling Code [numeric]
Mean (sd) : 360.5 (323.2)
min ≤ med ≤ max:
1 ≤ 255.5 ≤ 1876
IQR (CV) : 424.2 (0.9)
182 distinct values 1 (0.5%)
Capital/Major City [character]
1. Abu Dhabi
2. Abuja
3. Accra
4. Addis Ababa
5. Algiers
6. Amman
7. Amsterdam
8. Andorra la Vella
9. Ankara
10. Antananarivo
[ 182 others ]
1 ( 0.5% )
1 ( 0.5% )
1 ( 0.5% )
1 ( 0.5% )
1 ( 0.5% )
1 ( 0.5% )
1 ( 0.5% )
1 ( 0.5% )
1 ( 0.5% )
1 ( 0.5% )
182 ( 94.8% )
3 (1.5%)
Co2-Emissions [numeric]
Mean (sd) : 177799.2 (838790.3)
min ≤ med ≤ max:
11 ≤ 12303 ≤ 9893038
IQR (CV) : 61580 (4.7)
184 distinct values 7 (3.6%)
CPI [numeric]
Mean (sd) : 190.5 (397.9)
min ≤ med ≤ max:
99 ≤ 125.3 ≤ 4583.7
IQR (CV) : 43.4 (2.1)
175 distinct values 17 (8.7%)
CPI Change (%) [character]
1. 1.80%
2. 2.80%
3. 2.60%
4. 0.80%
5. 1.00%
6. 1.40%
7. 1.60%
8. 2.10%
9. 2.30%
10. 0.40%
[ 76 others ]
7 ( 3.9% )
7 ( 3.9% )
6 ( 3.4% )
5 ( 2.8% )
5 ( 2.8% )
5 ( 2.8% )
5 ( 2.8% )
5 ( 2.8% )
5 ( 2.8% )
4 ( 2.2% )
125 ( 69.8% )
16 (8.2%)
Currency-Code [character]
1. EUR
2. XOF
3. USD
4. XCD
5. XAF
6. AUD
7. CHF
8. AED
9. AFN
10. ALL
[ 123 others ]
23 ( 12.8% )
8 ( 4.4% )
6 ( 3.3% )
6 ( 3.3% )
5 ( 2.8% )
4 ( 2.2% )
2 ( 1.1% )
1 ( 0.6% )
1 ( 0.6% )
1 ( 0.6% )
123 ( 68.3% )
15 (7.7%)
Fertility Rate [numeric]
Mean (sd) : 2.7 (1.3)
min ≤ med ≤ max:
1 ≤ 2.2 ≤ 6.9
IQR (CV) : 1.9 (0.5)
139 distinct values 7 (3.6%)
Forested Area (%) [character]
1. 0.00%
2. 12.60%
3. 32.70%
4. 33.20%
5. 43.10%
6. 0.10%
7. 0.20%
8. 0.50%
9. 0.80%
10. 1.10%
[ 151 others ]
3 ( 1.6% )
3 ( 1.6% )
3 ( 1.6% )
3 ( 1.6% )
3 ( 1.6% )
2 ( 1.1% )
2 ( 1.1% )
2 ( 1.1% )
2 ( 1.1% )
2 ( 1.1% )
163 ( 86.7% )
7 (3.6%)
Gasoline Price [character]
1. $0.71
2. $0.92
3. $1.16
4. $0.98
5. $1.12
6. $0.40
7. $0.80
8. $0.83
9. $0.90
10. $0.91
[ 91 others ]
6 ( 3.4% )
5 ( 2.9% )
5 ( 2.9% )
4 ( 2.3% )
4 ( 2.3% )
3 ( 1.7% )
3 ( 1.7% )
3 ( 1.7% )
3 ( 1.7% )
3 ( 1.7% )
136 ( 77.7% )
20 (10.3%)
GDP [character]
1. $1,050,992,593
2. $1,119,190,780,753
3. $1,185,728,677
4. $1,228,170,370
5. $1,258,286,717,125
6. $1,340,389,411
7. $1,392,680,589,329
8. $1,394,116,310,769
9. $1,425,074,226
10. $1,637,931,034
[ 183 others ]
1 ( 0.5% )
1 ( 0.5% )
1 ( 0.5% )
1 ( 0.5% )
1 ( 0.5% )
1 ( 0.5% )
1 ( 0.5% )
1 ( 0.5% )
1 ( 0.5% )
1 ( 0.5% )
183 ( 94.8% )
2 (1.0%)
Gross primary education enrollment (%) [character]
1. 100.90%
2. 104.00%
3. 100.00%
4. 100.20%
5. 100.30%
6. 100.40%
7. 101.90%
8. 103.20%
9. 106.20%
10. 106.40%
[ 131 others ]
4 ( 2.1% )
4 ( 2.1% )
3 ( 1.6% )
3 ( 1.6% )
3 ( 1.6% )
3 ( 1.6% )
3 ( 1.6% )
3 ( 1.6% )
3 ( 1.6% )
3 ( 1.6% )
156 ( 83.0% )
7 (3.6%)
Gross tertiary education enrollment (%) [character]
1. 10.20%
2. 11.60%
3. 12.80%
4. 14.10%
5. 23.70%
6. 63.90%
7. 65.60%
8. 82.00%
9. 88.20%
10. 9.00%
[ 161 others ]
3 ( 1.6% )
2 ( 1.1% )
2 ( 1.1% )
2 ( 1.1% )
2 ( 1.1% )
2 ( 1.1% )
2 ( 1.1% )
2 ( 1.1% )
2 ( 1.1% )
2 ( 1.1% )
162 ( 88.5% )
12 (6.2%)
Infant mortality [numeric]
Mean (sd) : 21.3 (19.5)
min ≤ med ≤ max:
1.4 ≤ 14 ≤ 84.5
IQR (CV) : 26.7 (0.9)
144 distinct values 6 (3.1%)
Largest city [character]
1. S����
2. Abidjan
3. Accra
4. Addis Ababa
5. Algiers
6. Almaty
7. Amman
8. Amsterdam
9. Andorra la Vella
10. Antananarivo
[ 178 others ]
2 ( 1.1% )
1 ( 0.5% )
1 ( 0.5% )
1 ( 0.5% )
1 ( 0.5% )
1 ( 0.5% )
1 ( 0.5% )
1 ( 0.5% )
1 ( 0.5% )
1 ( 0.5% )
178 ( 94.2% )
6 (3.1%)
Life expectancy [numeric]
Mean (sd) : 72.3 (7.5)
min ≤ med ≤ max:
52.8 ≤ 73.2 ≤ 85.4
IQR (CV) : 10.5 (0.1)
134 distinct values 8 (4.1%)
Maternal mortality ratio [numeric]
Mean (sd) : 160.4 (233.5)
min ≤ med ≤ max:
2 ≤ 53 ≤ 1150
IQR (CV) : 173 (1.5)
114 distinct values 14 (7.2%)
Minimum wage [character]
1. $0.41
2. $2.00
3. $0.01
4. $0.05
5. $0.09
6. $0.23
7. $0.24
8. $0.25
9. $0.27
10. $0.29
[ 104 others ]
3 ( 2.0% )
3 ( 2.0% )
2 ( 1.3% )
2 ( 1.3% )
2 ( 1.3% )
2 ( 1.3% )
2 ( 1.3% )
2 ( 1.3% )
2 ( 1.3% )
2 ( 1.3% )
128 ( 85.3% )
45 (23.1%)
Official language [character]
1. English
2. French
3. Spanish
4. Arabic
5. Portuguese
6. German
7. None
8. Russian
9. Swahili
10. Italian
[ 67 others ]
31 ( 16.0% )
25 ( 12.9% )
19 ( 9.8% )
18 ( 9.3% )
7 ( 3.6% )
4 ( 2.1% )
4 ( 2.1% )
4 ( 2.1% )
4 ( 2.1% )
3 ( 1.5% )
75 ( 38.7% )
1 (0.5%)
Out of pocket health expenditure [character]
1. 15.20%
2. 36.70%
3. 40.50%
4. 10.20%
5. 12.50%
6. 14.80%
7. 16.90%
8. 17.60%
9. 18.30%
10. 19.60%
[ 150 others ]
3 ( 1.6% )
3 ( 1.6% )
3 ( 1.6% )
2 ( 1.1% )
2 ( 1.1% )
2 ( 1.1% )
2 ( 1.1% )
2 ( 1.1% )
2 ( 1.1% )
2 ( 1.1% )
165 ( 87.8% )
7 (3.6%)
Physicians per thousand [numeric]
Mean (sd) : 1.8 (1.7)
min ≤ med ≤ max:
0 ≤ 1.5 ≤ 8.4
IQR (CV) : 2.6 (0.9)
152 distinct values 7 (3.6%)
Population [numeric]
Mean (sd) : 39381164 (145092392)
min ≤ med ≤ max:
836 ≤ 8826588 ≤ 1397715000
IQR (CV) : 26622812 (3.7)
194 distinct values 1 (0.5%)
Population: Labor force participation (%) [character]
1. 65.10%
2. 68.80%
3. 72.00%
4. 46.40%
5. 52.90%
6. 53.60%
7. 56.50%
8. 59.10%
9. 59.50%
10. 59.70%
[ 135 others ]
3 ( 1.7% )
3 ( 1.7% )
3 ( 1.7% )
2 ( 1.1% )
2 ( 1.1% )
2 ( 1.1% )
2 ( 1.1% )
2 ( 1.1% )
2 ( 1.1% )
2 ( 1.1% )
153 ( 86.9% )
19 (9.7%)
Tax revenue (%) [character]
1. 19.50%
2. 10.20%
3. 13.60%
4. 14.20%
5. 18.60%
6. 20.10%
7. 23.00%
8. 0.00%
9. 10.10%
10. 10.80%
[ 109 others ]
4 ( 2.4% )
3 ( 1.8% )
3 ( 1.8% )
3 ( 1.8% )
3 ( 1.8% )
3 ( 1.8% )
3 ( 1.8% )
2 ( 1.2% )
2 ( 1.2% )
2 ( 1.2% )
141 ( 83.4% )
26 (13.3%)
Total tax rate [character]
1. 36.60%
2. 49.70%
3. 22.20%
4. 30.10%
5. 30.60%
6. 31.60%
7. 32.60%
8. 33.20%
9. 36.20%
10. 37.00%
[ 146 others ]
4 ( 2.2% )
3 ( 1.6% )
2 ( 1.1% )
2 ( 1.1% )
2 ( 1.1% )
2 ( 1.1% )
2 ( 1.1% )
2 ( 1.1% )
2 ( 1.1% )
2 ( 1.1% )
160 ( 87.4% )
12 (6.2%)
Unemployment rate [character]
1. 4.59%
2. 11.85%
3. 2.46%
4. 3.32%
5. 3.47%
6. 4.11%
7. 4.20%
8. 4.34%
9. 5.36%
10. 5.56%
[ 154 others ]
3 ( 1.7% )
2 ( 1.1% )
2 ( 1.1% )
2 ( 1.1% )
2 ( 1.1% )
2 ( 1.1% )
2 ( 1.1% )
2 ( 1.1% )
2 ( 1.1% )
2 ( 1.1% )
155 ( 88.1% )
19 (9.7%)
Urban_population [numeric]
Mean (sd) : 22304543 (75430501)
min ≤ med ≤ max:
5464 ≤ 4678104 ≤ 842933962
IQR (CV) : 13750278 (3.4)
190 distinct values 5 (2.6%)
Latitude [numeric]
Mean (sd) : 19.1 (24)
min ≤ med ≤ max:
-40.9 ≤ 17.3 ≤ 65
IQR (CV) : 35.6 (1.3)
194 distinct values 1 (0.5%)
Longitude [numeric]
Mean (sd) : 20.2 (66.7)
min ≤ med ≤ max:
-175.2 ≤ 21 ≤ 178.1
IQR (CV) : 56.2 (3.3)
194 distinct values 1 (0.5%)

Generated by summarytools 1.0.1 (R version 4.3.0)
2023-07-20

Analysis

The research questions I want to explore with this data set are the following:

  • Is there a correlation between primary education enrollment and the unemployment rate?

  • Does CO2 emission have an impact on life expectancy, fertility rate, and infant mortality rate?

This data set has several missing values, and using the code below reveals there are missing values under every column, with the exception of “Country Name” and “Density”. There are also several rows where a particular name in a row in unfinished/ incomplete. To tidy the data, I’ve replaced all incomplete names of countries, capital cities, and largest city names with the complete and correct names.

#check with missing values
anyNA(world_data, recursive = TRUE)
[1] TRUE
#column names with missing values
names(which(colSums(is.na(world_data))>0))
 [1] "Abbreviation"                             
 [2] "Agricultural Land( %)"                    
 [3] "Land Area(Km2)"                           
 [4] "Armed Forces size"                        
 [5] "Birth Rate"                               
 [6] "Calling Code"                             
 [7] "Capital/Major City"                       
 [8] "Co2-Emissions"                            
 [9] "CPI"                                      
[10] "CPI Change (%)"                           
[11] "Currency-Code"                            
[12] "Fertility Rate"                           
[13] "Forested Area (%)"                        
[14] "Gasoline Price"                           
[15] "GDP"                                      
[16] "Gross primary education enrollment (%)"   
[17] "Gross tertiary education enrollment (%)"  
[18] "Infant mortality"                         
[19] "Largest city"                             
[20] "Life expectancy"                          
[21] "Maternal mortality ratio"                 
[22] "Minimum wage"                             
[23] "Official language"                        
[24] "Out of pocket health expenditure"         
[25] "Physicians per thousand"                  
[26] "Population"                               
[27] "Population: Labor force participation (%)"
[28] "Tax revenue (%)"                          
[29] "Total tax rate"                           
[30] "Unemployment rate"                        
[31] "Urban_population"                         
[32] "Latitude"                                 
[33] "Longitude"                                
# Define the rows and columns to be changed
rows_to_change_country <- c(151)
rows_to_change_capital <- c(24, 32, 38, 41, 77, 105, 113, 137, 151, 176, 177)
rows_to_change_largest_city <- c(24, 38, 41, 44, 77, 105, 113,  151, 169, 170, 176, 177)

# Define the new names for each column
new_names_country <- c("Sao Tome and Principe dobra")
new_names_capital <- c("Brasília", "Yaoundé", "Bogotá", "San José", "Reykjavík","Malé", "Chişinău", "Asunción", "São Tomé", "Lomé", "Nuku'alofa")
new_names_largest_city <- c("Brasília", "Bogotá", "San José", "Nicosa", "Reykjavík", "Malé", "Chişinău", "São Tomé", "Stockholm", "Zürich", "Lomé", "Nuku'alofa")

# Apply changes to the specified rows and columns
world_data$Country[rows_to_change_country] <- new_names_country
world_data$`Capital/Major City`[rows_to_change_capital] <- new_names_capital
world_data$`Largest city`[rows_to_change_largest_city] <- new_names_largest_city

world_data

Visualization

In order to plot the following graphs, we need to convert Unemployment rate, and gross primary education enrollment from “character” to “numeric”.

#convert from character to numeric
world_data$`Unemployment rate` <- gsub("%", "", world_data$`Unemployment rate`)
world_data$`Unemployment rate` <- as.numeric (world_data$`Unemployment rate`)
class(world_data$`Unemployment rate`)
[1] "numeric"
world_data$`Gross primary education enrollment (%)` <- gsub("%", "", world_data$`Gross primary education enrollment (%)`)
world_data$`Gross primary education enrollment (%)` <- as.numeric (world_data$`Gross primary education enrollment (%)`)
class(world_data$`Gross primary education enrollment (%)`)
[1] "numeric"
# Correlation between Education enrollment and unemployment rate
ggplot(world_data, aes(x = `Gross primary education enrollment (%)`, y = `Unemployment rate`)) +
  geom_point()+
  labs(x = "Education Enrollment Rate", y = "Unemployment Rate") +
  theme_minimal()+
  ggtitle("Correlation between Education Enrollment Rate and Unemployment Rate")

The scatter plot suggests a negative correlation between primary education enrollment and unemployment; meaning as education enrollment increases, the unemployment rate tends to decrease, and vice versa. The plots are also fairly tightly clustered, which can suggest a stronger correlation. Although correlation does not have to imply causation, the negative correlation can suggest that countries with higher education enrollment rates are also more likely to have lower unemployment rates.

ggplot(world_data, aes(x = `Co2-Emissions`, y=Rate)) +
  geom_line(aes(y = `Life expectancy`, color = "Life Expectancy")) +
  geom_line(aes(y = `Fertility Rate`, color = "Fertility Rate")) +
  geom_line(aes(y = `Infant mortality`, color = "Infant Mortality Rate")) +
  scale_color_manual(name = "Variables",
                     values = c("CO2 Emissions" = "brown",
                                "Life Expectancy" = "green",
                                "Fertility Rate" = "blue",
                                "Infant Mortality Rate" = "black")) +
  theme_minimal()+
  ggtitle("Correlation between Co2 Emissions, Fertility Rate, and Infant Mortality Rate")

While the line graph is difficult to inspect visually based on the graph above, it appears the higher levels of Co2 emmisions result in generally higher levels of infant mortality rate, lower fertility rate, and lower life expectancy.

top_10_most<- world_data %>%
  arrange(desc(Population)) %>%
  head(10)

top_10_least <- world_data %>%
  arrange(Population) %>%
  head(10)

ggplot(top_10_most, aes(x = Country, y = Population, fill=Country)) +
  geom_bar(stat = "identity", alpha = 0.8) +
  labs(title = "Top 10 most populated Country",
       y = "Population",
       x = "Country") +
  theme_linedraw()+
  coord_flip(xlim = NULL, ylim = NULL, expand = FALSE, clip = "on") 

ggplot(top_10_least, aes(x = Country, y = Population, fill=Country)) +
  geom_bar(stat = "identity", alpha = 0.8) +
  labs(title = "10 Least populated countries",
       y = "Population",
       x = "Country") +
  theme_linedraw()+
  coord_flip(xlim = NULL, ylim = NULL, expand = FALSE, clip = "on") 

The graphs encompass all countries, with a distinction between the most populated and least populated ones. I initially tried to graph all the countries together,;however, due to the extensive number of countries in the data set, I narrowed down the data set to focus on the top 10 most populated and 10 least populated countries for improved visualization. These graphs show the most populated country in this data set is China, and the least populated country is Vatican City.

Future Analysis

Although my analysis was limited to a select set of variables, this data set provides many additional aspects that can be subject to further analysis. Geographical data using latitude and longitude can be used to create heat maps to patterns. Another analysis could be done to use tax data to analyze tax rates in different countries and how it can relate to economic developments.

In conclusion, while my analysis was focused on only a few variables, there is potential for numerous other types of analysis due to the rich and diverse information provided in this data set.