Data Analytics and Computational Social Science: TB HW4 Exploring and Visualizing Project Data

Tory Bartelloni

Summary of Previous

This is the follow up to assignment three, which can be found here: https://rpubs.com/tbartelloni/872562

As a reminder, what we have done so far is…

Downloaded and read data from the World Bank detailing annual statistics for various economies, populations, and education of countries mostly with large economies.
Pivoted the table to make a column for Year (each year was individual columns).
Filtered for missing data (removed NA values).
Pivoted the table to make a column for each variable/statistic (all variables were in one categorical column).
Updated the variable types and renamed the columns.

This is what we have now.

rmarkdown::paged_table(final_wb_data)

Summarizing

First we will look to explore the data through summary statistics. We’ll use this time to better understand what data is available, what data is not, and do some comparisons between groups.

What Data is Available

First, I would like to know which variables are best to comapre between countries so I am going to build a comparison of non-NA observations per variable

observations <- final_wb_data %>% 
  group_by(c(Country_Code)) %>%
  summarise(Total_Obs = n(),
            Gini_Obs = sum(!is.na(Gini_Index)),
            EduB_Obs = sum(!is.na(Edu_Bachelor)),
            EduP_Obs = sum(!is.na(Edu_Primary)),
            EduST_Obs = sum(!is.na(Edu_Short_Tertiary)),
            EduSC_Obs = sum(!is.na(Edu_Seconday)),
            EduM_Obs = sum(!is.na(Edu_Master)),
            EduD_Obs = sum(!is.na(Edu_Doctoral)),
            Crop_Obs = sum(!is.na(Crop_Prod_Index)),
            PopD_Obs = sum(!is.na(Population_Density)),
            UG_Obs = sum(!is.na(Urban_Growth)),
            RG_Obs = sum(!is.na(Rural_Growth)),
            WW_Obs = sum(!is.na(Wage_Workers)),
            SR_Obs = sum(!is.na(Suicide_Rate_100K)),
            GDP_Obs = sum(!is.na(GDP)),
            GDP_Obs = sum(!is.na(GDP))
            )
rmarkdown::paged_table(observations)

Good to know. Looks like many of the variables have a small number of observations in many or most countries. Especially the Education variables, which would likely not be useful or could only be used to compare certain countries/years.

I am very interested to know a better way to do this as I perform a similar operation later. Maybe I need to write my own function for this type of opertion or explore the apply family more.

Explore GDP

In this section I will look to do some basic exploration about the GDP reported in the data, including comparing it by country and year.

# First we'll look at overalls and the spread of the data
mean(final_wb_data$GDP, na.rm=TRUE)

[1] 3586042934565

median(final_wb_data$GDP, na.rm=TRUE)

[1] 2109947351421

sd(final_wb_data$GDP, na.rm=TRUE)

[1] 4345680708676

# Next we'll compare years and include GDP per capita.
compared_GDP <- final_wb_data %>% 
  group_by(Year) %>%
  summarise(Average_GDP =  mean(GDP),
            SD_GDP = sd(GDP),
            Mean_GDP_Per_Cap = mean(GDP_Per_Cap),
            SD_GPD_Per_CP = sd(GDP_Per_Cap)) %>%
  arrange(Year)

rmarkdown::paged_table(compared_GDP)

Explore More Variables

Now we’ll expand on that idea and compare several of the variables, but this time comparing between countries.

gws_comp <- final_wb_data %>% group_by(c(Country_Code)) %>%
  summarise(Count = n(),
            Average_Gini = mean(Gini_Index, na.rm=TRUE),
            Median_Gini = median(Gini_Index, na.rm=TRUE),
            SD_Gini = sd(Gini_Index, na.rm=TRUE),
            Average_Wage_Workers = mean(Wage_Workers, na.rm=TRUE),
            Median_Wage_Workers = median(Wage_Workers, na.rm=TRUE),
            SD_Wage_Workers = sd(Wage_Workers, na.rm=TRUE),
            Average_Suicide_Rate = mean(Suicide_Rate_100K, na.rm=TRUE),
            Median_Suicide_Rate = median(Suicide_Rate_100K, na.rm=TRUE),
            SDe_Suicide_Rate = sd(Suicide_Rate_100K, na.rm=TRUE),
            )

rmarkdown::paged_table(gws_comp)

Again, interesting. But there has to be a better way.

Explore by Visualization

Univariate

First off, we’ll plot out the average Gini coefficient for each year.

  final_wb_data %>% group_by(Year) %>%
    summarise(Count = n(),
              Average_Gini = mean(Gini_Index, na.rm=TRUE)) %>%
           ggplot(aes(x=Average_Gini)) +
           geom_histogram(binwidth=2)

Bivariate (Sort Of)

Next, let’s look at Gini Coefficient over time, by country.

final_wb_data %>%
  ggplot(aes(x=Year,y=Gini_Index)) +
  geom_point(aes(color=Country_Code),size=2) +
  geom_line(aes(color=Country_Code),size=1) +
  theme_bw()

Not entirely sure why the points are not continuously connected. I know the cause is that there is a gap in the data between years. Oh…maybe because my year variable is a number and not a date? Not sure I’ve had this issue before.

Just For Fun

One last plot, mostly for fun, but also to test something and also to raise a question.

ggplotly(final_wb_data %>%
  ggplot(aes(x=Gini_Index,y=Suicide_Rate_100K)) +
  geom_point(aes(color=Country_Code,fill="black"),size=2,alpha=0.5))

So looks like Plotly works just fine here! Which is great.

My question is about the fill command. Not necessarily what it did, but why the color does not match the command?

Closing

Thank you again for your time if you decided to review and read through my assignment. It was fun to do and, as alway, has left me with many questions to continue to explore.

Comment on this article Share:

TB HW4 Exploring and Visualizing Project Data