This assignment explores, summarizes, and visualizes economic and population data from the World Bank.
This is the follow up to assignment three, which can be found here: https://rpubs.com/tbartelloni/872562
As a reminder, what we have done so far is…
This is what we have now.
rmarkdown::paged_table(final_wb_data)
First we will look to explore the data through summary statistics. We’ll use this time to better understand what data is available, what data is not, and do some comparisons between groups.
First, I would like to know which variables are best to comapre between countries so I am going to build a comparison of non-NA observations per variable
observations <- final_wb_data %>%
group_by(c(Country_Code)) %>%
summarise(Total_Obs = n(),
Gini_Obs = sum(!is.na(Gini_Index)),
EduB_Obs = sum(!is.na(Edu_Bachelor)),
EduP_Obs = sum(!is.na(Edu_Primary)),
EduST_Obs = sum(!is.na(Edu_Short_Tertiary)),
EduSC_Obs = sum(!is.na(Edu_Seconday)),
EduM_Obs = sum(!is.na(Edu_Master)),
EduD_Obs = sum(!is.na(Edu_Doctoral)),
Crop_Obs = sum(!is.na(Crop_Prod_Index)),
PopD_Obs = sum(!is.na(Population_Density)),
UG_Obs = sum(!is.na(Urban_Growth)),
RG_Obs = sum(!is.na(Rural_Growth)),
WW_Obs = sum(!is.na(Wage_Workers)),
SR_Obs = sum(!is.na(Suicide_Rate_100K)),
GDP_Obs = sum(!is.na(GDP)),
GDP_Obs = sum(!is.na(GDP))
)
rmarkdown::paged_table(observations)
Good to know. Looks like many of the variables have a small number of observations in many or most countries. Especially the Education variables, which would likely not be useful or could only be used to compare certain countries/years.
I am very interested to know a better way to do this as I perform a similar operation later. Maybe I need to write my own function for this type of opertion or explore the apply family more.
In this section I will look to do some basic exploration about the GDP reported in the data, including comparing it by country and year.
# First we'll look at overalls and the spread of the data
mean(final_wb_data$GDP, na.rm=TRUE)
[1] 3586042934565
median(final_wb_data$GDP, na.rm=TRUE)
[1] 2109947351421
sd(final_wb_data$GDP, na.rm=TRUE)
[1] 4345680708676
# Next we'll compare years and include GDP per capita.
compared_GDP <- final_wb_data %>%
group_by(Year) %>%
summarise(Average_GDP = mean(GDP),
SD_GDP = sd(GDP),
Mean_GDP_Per_Cap = mean(GDP_Per_Cap),
SD_GPD_Per_CP = sd(GDP_Per_Cap)) %>%
arrange(Year)
rmarkdown::paged_table(compared_GDP)
Now we’ll expand on that idea and compare several of the variables, but this time comparing between countries.
gws_comp <- final_wb_data %>% group_by(c(Country_Code)) %>%
summarise(Count = n(),
Average_Gini = mean(Gini_Index, na.rm=TRUE),
Median_Gini = median(Gini_Index, na.rm=TRUE),
SD_Gini = sd(Gini_Index, na.rm=TRUE),
Average_Wage_Workers = mean(Wage_Workers, na.rm=TRUE),
Median_Wage_Workers = median(Wage_Workers, na.rm=TRUE),
SD_Wage_Workers = sd(Wage_Workers, na.rm=TRUE),
Average_Suicide_Rate = mean(Suicide_Rate_100K, na.rm=TRUE),
Median_Suicide_Rate = median(Suicide_Rate_100K, na.rm=TRUE),
SDe_Suicide_Rate = sd(Suicide_Rate_100K, na.rm=TRUE),
)
rmarkdown::paged_table(gws_comp)
Again, interesting. But there has to be a better way.
First off, we’ll plot out the average Gini coefficient for each year.
final_wb_data %>% group_by(Year) %>%
summarise(Count = n(),
Average_Gini = mean(Gini_Index, na.rm=TRUE)) %>%
ggplot(aes(x=Average_Gini)) +
geom_histogram(binwidth=2)
Next, let’s look at Gini Coefficient over time, by country.
final_wb_data %>%
ggplot(aes(x=Year,y=Gini_Index)) +
geom_point(aes(color=Country_Code),size=2) +
geom_line(aes(color=Country_Code),size=1) +
theme_bw()
Not entirely sure why the points are not continuously connected. I know the cause is that there is a gap in the data between years. Oh…maybe because my year variable is a number and not a date? Not sure I’ve had this issue before.
One last plot, mostly for fun, but also to test something and also to raise a question.
So looks like Plotly works just fine here! Which is great.
My question is about the fill command. Not necessarily what it did, but why the color does not match the command?
Thank you again for your time if you decided to review and read through my assignment. It was fun to do and, as alway, has left me with many questions to continue to explore.
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Bartelloni (2022, March 6). Data Analytics and Computational Social Science: TB HW4 Exploring and Visualizing Project Data. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomtbartelloni873045/
BibTeX citation
@misc{bartelloni2022tb, author = {Bartelloni, Tory}, title = {Data Analytics and Computational Social Science: TB HW4 Exploring and Visualizing Project Data}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomtbartelloni873045/}, year = {2022} }