Data Analytics and Computational Social Science: TB HW3 World Bank Data Cleaning

Tory Bartelloni

Introduction

The data I have chosen to work with for my final project is taken from the World Bank and contains a number statistics related to education, income distribution, and suicide amongst other things. The data is organized as annual measurements of each statistic for a range of countries including the US, China, Russia, and some of Europe’s largest economies. In this assignment I will be tidying the data to better understand what is available.

First, let’s take a look at the outline of the data.

rmarkdown::paged_table(world_bank_data)

Defining the Variables

As seen above, the dataset is formatted with years in the columns and each row being an observation of country and statistic/metric/variable. Before we make any transformations we will define the variables that are available so we can better understand what transformations would make sense. We know we have country and year so let’s look at the series column to see what else is there.

table(world_bank_data$`Series Name`)


                                                                Crop production index (2014-2016 = 100) 
                                                                                                     10 
      Educational attainment, at least Bachelor's or equivalent, population 25+, total (%) (cumulative) 
                                                                                                     10 
       Educational attainment, at least completed primary, population 25+ years, total (%) (cumulative) 
                                                                                                     10 
Educational attainment, at least completed short-cycle tertiary, population 25+, total (%) (cumulative) 
                                                                                                     10 
     Educational attainment, at least completed upper secondary, population 25+, total (%) (cumulative) 
                                                                                                     10 
        Educational attainment, at least Master's or equivalent, population 25+, total (%) (cumulative) 
                                                                                                     10 
                 Educational attainment, Doctoral or equivalent, population 25+, total (%) (cumulative) 
                                                                                                     10 
                                                                                GDP (constant 2015 US$) 
                                                                                                     10 
                                                                     GDP per capita (constant 2015 US$) 
                                                                                                     10 
                                                                       Gini index (World Bank estimate) 
                                                                                                     10 
                                                    Population density (people per sq. km of land area) 
                                                                                                     10 
                                                                     Rural population growth (annual %) 
                                                                                                     10 
                                                        Suicide mortality rate (per 100,000 population) 
                                                                                                     10 
                                                                     Urban population growth (annual %) 
                                                                                                     10 
                        Wage and salaried workers, total (% of total employment) (modeled ILO estimate) 
                                                                                                     10

Alright, let’s define the available data. One caveat, due to the format of the data all of the variables are currently class “character”, but in the list below I will define their ideal class.

Country Name - factor - This is the name of the country that the observation applies to.
Country Code - factor - The shorthand abbreviation for the country name that the observation applies to.
Series Name - character - The long/full name of the statistic that is being observed.
Series Code - character - The shorthand abbreviation for the statistic that is being observed.
Crop production index (2014-2016 = 100) - numeric - This variable shows annual agricultural production compared to the base period (2014-2016).
Educational attainment, at least Bachelor’s or equivalent, population 25+, total (%) (cumulative) - numeric - The total portion of the population above age 25 that has completed an education of a Bachelor’s degree, equivalent, or higher.
Educational attainment, at least completed primary, population 25+ years, total (%) (cumulative) - numeric - The total portion of hte population above age 25 that has completed an education of a primary or higher (elementary school in the US).
Educational attainment, at least completed short-cycle tertiary, population 25+, total (%) (cumulative) - numeric - The total portion of the population above age 25 that has completed an education of short-cycle tertiary or higher (Associates degree in the US).
Educational attainment, at least completed upper secondary, population 25+, total (%) (cumulative) = numeric - The total portion of the population above age 25 that has completed an education of upper secondary or higher (High School in the US).
Educational attainment, at least Master’s or equivalent, population 25+, total (%) (cumulative) - numeric - The total portion of the population above age 25 that has completed an education of Master’s, equivalent, or higher.
Educational attainment, Doctoral or equivalent, population 25+, total (%) (cumulative) - numeric - The total portion of the population above the age 25 that has completed an education of Doctoral, equivalent, or higher.
GDP (constant 2015 US$) - numeric - The country Gross Domestic Product in 2015 US dollars.
GDP per capita (constant 2015 US$) - numeric - The country Gross Domestic product per capita in 2015 US dollars.
Gini index (World Bank estimate) - numeric - The World Bank’s Development Research Group’s estimate of country Gini Index, which is an estimat of the overall income distribution (or income inequality).
Population density (people per sq. km of land area) - numeric - The population density, calculated by mid-year population divided by land area in square kilometers.
Rural population growth (annual %) - numeric - The percentage growth of rural population, calculated as the difference between rural and urban population.
Suicide mortality rate (per 100,000 population) - integer - The number of suicide deaths in a year per 100,000 population. Crude suicide rate (not age-adjusted).
Urban population growth (annual %) - numeric - The percentage growth of urban population, calculated as the difference between rural and urban population.
Wage and salaried workers, total (% of total employment) (modeled ILO estimate) - numeric - The portion of the total workforce that hold a job defined as paid employment (i.e. pay is is not directly dependent upon the revenue of the unit for which they work).

Perhaps a little overkill on defining the available variables, but now we have a very good idea of what we have to work with. In summary, we have variables identifying different countries, years between 1988 and 2020, and a number of population and economic statistics including education attainment, GDP, and Gini Index.

Cleaning and Tidy-ing

Before we decide what questions we want to ask (and can ask of the data) we will clean up what we have.

First, let’s take only the columns that we need. I will be dropping the long form versions of country and series.
Then, we will pivot (longer) the year data to make year a new column.
Lastly, we will filter out rows that have no data. In this dataset those are indicated by a double point/period (“..”).

long_pivot_world_bank_data <- world_bank_data %>% 
  select(-`Country Name`,-`Series Name`) %>%
  pivot_longer("1988 [YR1988]":"2020 [YR2020]",
               names_to = "Year",
               names_transform = list(Year=parse_number),
               values_to = "Value") %>%
  filter(Value != "..")

rmarkdown::paged_table(long_pivot_world_bank_data)

We can see that we did, indeed, make the table longer. Where our latest table has nearly 3000 observations and the pre-pivot table was only 155 observations.

We want our rows to be Country, Year, and the statistics for that country and year. What we have now that disrupts this is variables saved in the Series Code column. To solve this problem we will again pivot, but this time wider, to create columns for each of our varibles.

wide_pivot_world_bank_data <- long_pivot_world_bank_data %>%
  pivot_wider(names_from=c(`Series Code`),
              values_from = Value) %>%
  arrange(`Country Code`,Year)

rmarkdown::paged_table(wide_pivot_world_bank_data)

Lastly, before we truly dig in and explore what is available, we will make sure the column titles are clear and that the variables are in the correct format. Due to the double point/perdiod notation of NA values, all of our variables are class character so we’ll need to update them to be more appropriate classes.

#Update column names
colnames(wide_pivot_world_bank_data) <- c("Country_Code","Year","Gini_Index","Edu_Bachelor","Edu_Primary",
                                          "Edu_Short_Tertiary","Edu_Seconday","Edu_Master","Edu_Doctoral",
                                          "Crop_Prod_Index","Population_Density","Urban_Growth","Rural_Growth",
                                          "Suicide_Rate_100K","Wage_Workers","GDP","GDP_Per_Cap")
names(wide_pivot_world_bank_data)

 [1] "Country_Code"       "Year"               "Gini_Index"        
 [4] "Edu_Bachelor"       "Edu_Primary"        "Edu_Short_Tertiary"
 [7] "Edu_Seconday"       "Edu_Master"         "Edu_Doctoral"      
[10] "Crop_Prod_Index"    "Population_Density" "Urban_Growth"      
[13] "Rural_Growth"       "Suicide_Rate_100K"  "Wage_Workers"      
[16] "GDP"                "GDP_Per_Cap"

# Check which columns need to be updated
str(wide_pivot_world_bank_data)

tibble [330 x 17] (S3: tbl_df/tbl/data.frame)
 $ Country_Code      : chr [1:330] "AUS" "AUS" "AUS" "AUS" ...
 $ Year              : num [1:330] 1988 1989 1990 1991 1992 ...
 $ Gini_Index        : chr [1:330] NA "33.2" NA NA ...
 $ Edu_Bachelor      : chr [1:330] NA NA NA NA ...
 $ Edu_Primary       : chr [1:330] NA NA NA NA ...
 $ Edu_Short_Tertiary: chr [1:330] NA NA NA NA ...
 $ Edu_Seconday      : chr [1:330] NA NA NA NA ...
 $ Edu_Master        : chr [1:330] NA NA NA NA ...
 $ Edu_Doctoral      : chr [1:330] NA NA NA NA ...
 $ Crop_Prod_Index   : chr [1:330] "56.48" "57.17" "57.43" "51.89" ...
 $ Population_Density: chr [1:330] "2.1519857334392" "2.18871952410086" "2.22135298022728" "2.24984705101337" ...
 $ Urban_Growth      : chr [1:330] "1.60229012217722" "1.65747051532311" "1.44486746340481" "1.239458624037" ...
 $ Rural_Growth      : chr [1:330] "1.83631840572498" "1.89914714084439" "1.68614403054745" "1.48029642746829" ...
 $ Suicide_Rate_100K : chr [1:330] NA NA NA NA ...
 $ Wage_Workers      : chr [1:330] NA NA NA "80" ...
 $ GDP               : chr [1:330] "586536446754.782" "609202798452.45" "630972702488.834" "628479810251.912" ...
 $ GDP_Per_Cap       : chr [1:330] "35478.4267523247" "36231.0161797299" "36974.4509255049" "36361.9422733113" ...

# Update said columns
wide_pivot_world_bank_data[,c(3:17)] <- apply(X = wide_pivot_world_bank_data[,c(3:17)], MARGIN=2, FUN=function(x) as.numeric(x))

wide_pivot_world_bank_data$Country_Code <- as.factor(wide_pivot_world_bank_data$Country_Code)

rmarkdown::paged_table(wide_pivot_world_bank_data)

Hooray! We did it! We now have a tidy and very interesting dataset.

Possible Questions

Now that we have a Tidy dataset, let’s think about some questions that we could ask. My general approach to this research will be understanding if there are relationships between the available variables.

Is population density a predictor or indication of growing income inequality?
Does larger income inequality lead to higher suicide rates?
Do higher levels of educational attainment impact income inequality or GDP?
Do different country’s GDPs have different relationships to population dynamics (i.e. rural vs. urban economies)?
Does a country’s GDP or education attainment levels influence the portion of wage workers in the economy?

There are a number of other questions we could likely ask, but we will start very broad and try to work our way down to what questions should we be asking.

Closing

Thank you for taking time to review my thought process. I’m interested in the next steps and what insights we may get out of exploring this data.

Comment on this article Share:

TB HW3 World Bank Data Cleaning

Introduction

Defining the Variables

Cleaning and Tidy-ing

Possible Questions

Closing

Reuse

Citation