library(tidyverse)
library(ggplot2)
library(summarytools)
library(plotly)
library(stringr)
library(readxl)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Challenge 5
Challenge Overview
Today’s challenge is to:
- read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
- tidy data (as needed, including sanity checks)
- mutate variables as needed (including sanity checks)
- create at least two univariate visualizations
- try to make them “publication” ready
- Explain why you choose the specific graph type
- Create at least one bivariate visualization
- try to make them “publication” ready
- Explain why you choose the specific graph type
R Graph Gallery is a good starting point for thinking about what information is conveyed in standard graph types, and includes example R code.
Read in data
- USA Households ⭐⭐⭐⭐⭐
#! label: Data loading
#| warning: false
<- read_excel("../posts/_data/USA Households by Total Money Income, Race, and Hispanic Origin of Householder 1967 to 2019.xlsx",skip = 5, n_max = 353, col_names = c( "Year", "Number","Total","pd_<15000","pd_15000-24999","pd_25000-34999","pd_35000-49999","pd_50000-74999","pd_75000-99999","pd_100000-149999","pd_150000-199999","pd_>200000","median_income_estimate","median_income_moe","mean_income_estimate","mean_income_moe"))
US_household_data
head(US_household_data,5)
# A tibble: 5 × 16
Year Number Total pd_<1…¹ pd_15…² pd_25…³ pd_35…⁴ pd_50…⁵ pd_75…⁶ pd_10…⁷
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ALL RACES <NA> NA NA NA NA NA NA NA NA
2 2019 128451 100 9.1 8 8.3 11.7 16.5 12.3 15.5
3 2018 128579 100 10.1 8.8 8.7 12 17 12.5 15
4 2017 2 127669 100 10 9.1 9.2 12 16.4 12.4 14.7
5 2017 127586 100 10.1 9.1 9.2 11.9 16.3 12.6 14.8
# … with 6 more variables: `pd_150000-199999` <dbl>, `pd_>200000` <dbl>,
# median_income_estimate <dbl>, median_income_moe <dbl>,
# mean_income_estimate <chr>, mean_income_moe <chr>, and abbreviated variable
# names ¹`pd_<15000`, ²`pd_15000-24999`, ³`pd_25000-34999`,
# ⁴`pd_35000-49999`, ⁵`pd_50000-74999`, ⁶`pd_75000-99999`,
# ⁷`pd_100000-149999`
Briefly describe the data
I have chosen USA households data which provides the income statistics of the US residents across different races over the years. The features of the US Household data include the percent distribution of various income range across multiple US Households. I have chosen this data because it portrays a very important and insightful economical trend for households over a time period.
##Data Tidy process
The US household data is anything but tidy. Here are the following few operations that have been performed to make it tidy: 1. Separating Race and Year from one column 2. Mutating dataframe to include race column 3. Removing trailing character from both race and column 4. Converting Number and Year Numerical columns from character to number. 5. pivoting percent distribution into 2 columns[income_Range and percent_distribution] from 9 different income range feature of data .
#! label: Data processing
#| warning: false
<- US_household_data%>%
US_processed_data rowwise()%>% #to ensure the following operation runs row wise
mutate(Race=case_when(
is.na(Number) ~ Year
%>%
))ungroup()%>% # to stop rowwise operation
fill(Race,.direction = "down")%>%
subset(!is.na(Number))%>%
rowwise()%>%
mutate(
Year=strsplit(Year,' ')[[1]][1],
Race=ifelse(grepl("[0-9]", Race ,perl=TRUE)[1],strsplit(Race," \\s*(?=[^ ]+$)",perl=TRUE)[[1]][1],Race),
mean_income_estimate=as.numeric(mean_income_estimate),
Number=as.numeric(Number),
Year=as.numeric(Year)
%>%
)pivot_longer(
cols = starts_with("pd"),
names_to = "income_range",
values_to = "percent_distribution",
names_prefix="pd_"
)view(dfSummary(US_processed_data))
Document your work here.
One of the major problems in US Household data is to efficiently analyse the races. This is partially because the most of the races are not singular i.e. people may identify themselves as both White and Hispanic. To handle such situations, I had to bundle them into singular races as per the response data.
<-US_processed_data%>%
US_mean_income_dataselect(Year,mean_income_estimate,Race)%>%
group_by(Race,Year)%>%
summarize(race_mean_income_estimate=mean(mean_income_estimate))
<- US_mean_income_data%>%
grouped_racemutate(CombinedRace=case_when(
str_detect(Race,"ASIAN")~"ASIAN",
str_detect(Race,"BLACK")~"BLACK",
str_detect(Race,"WHITE")~"WHITE",
TRUE ~ Race
))head(grouped_race,5)
# A tibble: 5 × 4
# Groups: Race [1]
Race Year race_mean_income_estimate CombinedRace
<chr> <dbl> <dbl> <chr>
1 ALL RACES 1967 53616 ALL RACES
2 ALL RACES 1968 56572 ALL RACES
3 ALL RACES 1969 59004 ALL RACES
4 ALL RACES 1970 58926 ALL RACES
5 ALL RACES 1971 58609 ALL RACES
Univariate Visualizations
<- ggplot(grouped_race%>%filter(CombinedRace=="ALL RACES"),aes(x=Year,y=race_mean_income_estimate,alpha=0.4,fill=Race)) +
race_income_area geom_area() +
labs(title="All Races Income change across year",
x="Year", y="Mean Income Estimate ($)")
ggplotly(race_income_area)
The above graph shows the change in mean income for all races combined over the years. I have plotted an area chart to efficiently visualize the change in income over the years. This chart is very easy to read and provides the necessary visual information instantly. In comparison to line chart choosing area chart, would help readers determine the change more efficiently.
<-US_processed_data%>%
All_race_income_dataselect(Year,income_range,percent_distribution,Race)%>%
filter(Race=="ALL RACES",Year==2019)
head(All_race_income_data)
# A tibble: 6 × 4
Year income_range percent_distribution Race
<dbl> <chr> <dbl> <chr>
1 2019 <15000 9.1 ALL RACES
2 2019 15000-24999 8 ALL RACES
3 2019 25000-34999 8.3 ALL RACES
4 2019 35000-49999 11.7 ALL RACES
5 2019 50000-74999 16.5 ALL RACES
6 2019 75000-99999 12.3 ALL RACES
<- All_race_income_data%>%
All_race_income_data arrange(desc(income_range)) %>%
mutate(prop = percent_distribution / sum(All_race_income_data$percent_distribution) *100) %>%
mutate(ypos = cumsum(prop)- 0.5*prop )
<- ggplot(All_race_income_data,aes(x="",y=percent_distribution,fill=income_range)) +
race_income_distribution geom_bar(stat="identity", width=1, color="white")+
coord_polar("y",start=0)+
theme_void()+
theme(legend.position="none") +
geom_text(aes(y = ypos, label = income_range), color = "black", size=2) +
scale_fill_brewer(palette="Set1")+
labs(title="Income Distribution($) for 'ALL Races' in 2019")
race_income_distribution
One of the many challenges I faced to bring univariate plot was to boil down to one feature for analysis. The best way to do that in a diverse dataset with multiple features was to isolate a subset of data using filter and select. For this dataset, I wanted to understand the recent household statistics for ALL Races , hence I chose the year for 2019. This way I can truly isolate one variable of income range and its percent distribution.
Bivariate Visualization(s)
<- ggplot(grouped_race, aes(x=Year, y=race_mean_income_estimate, color=CombinedRace)) +
race_income_line geom_line() +
facet_wrap(vars(CombinedRace))+
labs(title="Income change across year for each race")
ggplotly(race_income_line)
For Bivariate data visualization, I chose to analyse the change of income across all the other races. As a sequential analysis from the previous data visualization of income change for All races across year, I plan to move on to the analyse the trend for other races group. One of the most simplest and most efficient visual analysis is line table, It optimally conveys the information across time series in a very easy and digestable manner. The simplicity and efficiency of line chart for time series data model was the main motivation behind choosing this.
Since the analysis is on a time scale, I would consider Year as an independent variable. While the mean income estimate and the races forms the part of bivariate analysis. Race was grouped as part of sanity checks as well.
One exciting trend as evident from the graph is Asian’s in particular has a steeper rise in mean household income especially after the 2011.