Third homework assisgnment for DACSS 601 - Identifying and cleaning up course project data.
The data I have chosen to work with for my final project is taken from the World Bank and contains a number statistics related to education, income distribution, and suicide amongst other things. The data is organized as annual measurements of each statistic for a range of countries including the US, China, Russia, and some of Europe’s largest economies. In this assignment I will be tidying the data to better understand what is available.
First, let’s take a look at the outline of the data.
rmarkdown::paged_table(world_bank_data)
As seen above, the dataset is formatted with years in the columns and each row being an observation of country and statistic/metric/variable. Before we make any transformations we will define the variables that are available so we can better understand what transformations would make sense. We know we have country and year so let’s look at the series column to see what else is there.
table(world_bank_data$`Series Name`)
Crop production index (2014-2016 = 100)
10
Educational attainment, at least Bachelor's or equivalent, population 25+, total (%) (cumulative)
10
Educational attainment, at least completed primary, population 25+ years, total (%) (cumulative)
10
Educational attainment, at least completed short-cycle tertiary, population 25+, total (%) (cumulative)
10
Educational attainment, at least completed upper secondary, population 25+, total (%) (cumulative)
10
Educational attainment, at least Master's or equivalent, population 25+, total (%) (cumulative)
10
Educational attainment, Doctoral or equivalent, population 25+, total (%) (cumulative)
10
GDP (constant 2015 US$)
10
GDP per capita (constant 2015 US$)
10
Gini index (World Bank estimate)
10
Population density (people per sq. km of land area)
10
Rural population growth (annual %)
10
Suicide mortality rate (per 100,000 population)
10
Urban population growth (annual %)
10
Wage and salaried workers, total (% of total employment) (modeled ILO estimate)
10
Alright, let’s define the available data. One caveat, due to the format of the data all of the variables are currently class “character”, but in the list below I will define their ideal class.
Perhaps a little overkill on defining the available variables, but now we have a very good idea of what we have to work with. In summary, we have variables identifying different countries, years between 1988 and 2020, and a number of population and economic statistics including education attainment, GDP, and Gini Index.
Before we decide what questions we want to ask (and can ask of the data) we will clean up what we have.
long_pivot_world_bank_data <- world_bank_data %>%
select(-`Country Name`,-`Series Name`) %>%
pivot_longer("1988 [YR1988]":"2020 [YR2020]",
names_to = "Year",
names_transform = list(Year=parse_number),
values_to = "Value") %>%
filter(Value != "..")
rmarkdown::paged_table(long_pivot_world_bank_data)
We can see that we did, indeed, make the table longer. Where our latest table has nearly 3000 observations and the pre-pivot table was only 155 observations.
We want our rows to be Country, Year, and the statistics for that country and year. What we have now that disrupts this is variables saved in the Series Code column. To solve this problem we will again pivot, but this time wider, to create columns for each of our varibles.
wide_pivot_world_bank_data <- long_pivot_world_bank_data %>%
pivot_wider(names_from=c(`Series Code`),
values_from = Value) %>%
arrange(`Country Code`,Year)
rmarkdown::paged_table(wide_pivot_world_bank_data)
Lastly, before we truly dig in and explore what is available, we will make sure the column titles are clear and that the variables are in the correct format. Due to the double point/perdiod notation of NA values, all of our variables are class character so we’ll need to update them to be more appropriate classes.
#Update column names
colnames(wide_pivot_world_bank_data) <- c("Country_Code","Year","Gini_Index","Edu_Bachelor","Edu_Primary",
"Edu_Short_Tertiary","Edu_Seconday","Edu_Master","Edu_Doctoral",
"Crop_Prod_Index","Population_Density","Urban_Growth","Rural_Growth",
"Suicide_Rate_100K","Wage_Workers","GDP","GDP_Per_Cap")
names(wide_pivot_world_bank_data)
[1] "Country_Code" "Year" "Gini_Index"
[4] "Edu_Bachelor" "Edu_Primary" "Edu_Short_Tertiary"
[7] "Edu_Seconday" "Edu_Master" "Edu_Doctoral"
[10] "Crop_Prod_Index" "Population_Density" "Urban_Growth"
[13] "Rural_Growth" "Suicide_Rate_100K" "Wage_Workers"
[16] "GDP" "GDP_Per_Cap"
# Check which columns need to be updated
str(wide_pivot_world_bank_data)
tibble [330 x 17] (S3: tbl_df/tbl/data.frame)
$ Country_Code : chr [1:330] "AUS" "AUS" "AUS" "AUS" ...
$ Year : num [1:330] 1988 1989 1990 1991 1992 ...
$ Gini_Index : chr [1:330] NA "33.2" NA NA ...
$ Edu_Bachelor : chr [1:330] NA NA NA NA ...
$ Edu_Primary : chr [1:330] NA NA NA NA ...
$ Edu_Short_Tertiary: chr [1:330] NA NA NA NA ...
$ Edu_Seconday : chr [1:330] NA NA NA NA ...
$ Edu_Master : chr [1:330] NA NA NA NA ...
$ Edu_Doctoral : chr [1:330] NA NA NA NA ...
$ Crop_Prod_Index : chr [1:330] "56.48" "57.17" "57.43" "51.89" ...
$ Population_Density: chr [1:330] "2.1519857334392" "2.18871952410086" "2.22135298022728" "2.24984705101337" ...
$ Urban_Growth : chr [1:330] "1.60229012217722" "1.65747051532311" "1.44486746340481" "1.239458624037" ...
$ Rural_Growth : chr [1:330] "1.83631840572498" "1.89914714084439" "1.68614403054745" "1.48029642746829" ...
$ Suicide_Rate_100K : chr [1:330] NA NA NA NA ...
$ Wage_Workers : chr [1:330] NA NA NA "80" ...
$ GDP : chr [1:330] "586536446754.782" "609202798452.45" "630972702488.834" "628479810251.912" ...
$ GDP_Per_Cap : chr [1:330] "35478.4267523247" "36231.0161797299" "36974.4509255049" "36361.9422733113" ...
# Update said columns
wide_pivot_world_bank_data[,c(3:17)] <- apply(X = wide_pivot_world_bank_data[,c(3:17)], MARGIN=2, FUN=function(x) as.numeric(x))
wide_pivot_world_bank_data$Country_Code <- as.factor(wide_pivot_world_bank_data$Country_Code)
rmarkdown::paged_table(wide_pivot_world_bank_data)
Hooray! We did it! We now have a tidy and very interesting dataset.
Now that we have a Tidy dataset, let’s think about some questions that we could ask. My general approach to this research will be understanding if there are relationships between the available variables.
There are a number of other questions we could likely ask, but we will start very broad and try to work our way down to what questions should we be asking.
Thank you for taking time to review my thought process. I’m interested in the next steps and what insights we may get out of exploring this data.
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Bartelloni (2022, March 2). Data Analytics and Computational Social Science: TB HW3 World Bank Data Cleaning. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomtbartelloni872562/
BibTeX citation
@misc{bartelloni2022tb, author = {Bartelloni, Tory}, title = {Data Analytics and Computational Social Science: TB HW3 World Bank Data Cleaning}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomtbartelloni872562/}, year = {2022} }