Code
library(tidyverse)
library("readxl")
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Vishnupriya Varadharaju
November 10, 2022
Firstly, on opening the excel file, it is seen that the format of the data in the file is not straightforward. The first three rows have information regarding the data in this file. This data is provides the household incomes of various races throughout the years from 1967-2019. The percentage distribution (out of 100) for different ranges in income, along with the median and mean income was also in the given dataset. There are 12 categories of the races. There are some overlaps between these categories as well (“All Races” category) and all the years are not present for some of the races.
# Reading in the USA Households dataset
# Removing the total column as information is redundant
h_income <- read_excel("_data/USA Households by Total Money Income, Race, and Hispanic Origin of Householder 1967 to 2019.xlsx",
skip=5, n_max = 351, col_names=c("year", "hnumber", "total","level1", "level2",
"level3","level4","level5","level6","level7","level8",
"level9","median_income","median_error",
"mean_income","mean_error") ) %>% select(-total)
income_vals <- c("level1","level2","level3","level4","level5","level6","level7","level8","level9")
income_levels <- c("Under $15000","$15000 to $29000","$25000 to $34999","$35000 to $49999","$50000 to $74999","$75000 to $99999","$100000 to $149999","1500000 to $199999","$200000 and over")
h_income
# A tibble: 351 × 15
year hnumber level1 level2 level3 level4 level5 level6 level7 level8 level9
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ALL R… <NA> NA NA NA NA NA NA NA NA NA
2 2019 128451 9.1 8 8.3 11.7 16.5 12.3 15.5 8.3 10.3
3 2018 128579 10.1 8.8 8.7 12 17 12.5 15 7.2 8.8
4 2017 2 127669 10 9.1 9.2 12 16.4 12.4 14.7 7.3 8.9
5 2017 127586 10.1 9.1 9.2 11.9 16.3 12.6 14.8 7.5 8.5
6 2016 126224 10.4 9 9.2 12.3 16.7 12.2 15 7.2 8
7 2015 125819 10.6 10 9.6 12.1 16.1 12.4 14.9 7.1 7.2
8 2014 124587 11.4 10.5 9.6 12.6 16.4 12.1 14 6.6 6.8
9 2013 3 123931 11.4 10.3 9.5 12.5 16.8 12 13.9 6.7 6.9
10 2013 4 122952 11.3 10.4 9.7 13.1 17 12.5 13.6 6.3 6
# … with 341 more rows, and 4 more variables: median_income <dbl>,
# median_error <dbl>, mean_income <chr>, mean_error <chr>
In the above data, we have the necessary household income values for various throughout different years. The levels of the income range have also been named accordingly and the mapping is found in the vector ‘income_levels’. There is a need to pivot this data in order to get the the races corresponding to each year. Currently, the races are only present atop each section of data. Pivoting will help in grouping the data and making calculations easier against different categories of race.
The current data has the races in Capitals above each group of data. It is best to add the category of race as a new column for all the rows. Here, we get all the categories of race. However there are some numeric characters also present in them which can be removed.
library(stringr)
# Creating a new column named race_cat + removing the rows with string only
# race category
h_income_race <- h_income %>% mutate(race_cat = case_when(
str_detect(year, "[A-Za-z]") ~ year,
TRUE ~ NA_character_
)) %>% fill(race_cat) %>% filter(!str_detect(year, "[A-Za-z]"))
# Removing the notes number from the year and race_cat columns
h_income_race <- h_income_race %>% separate(year, c("year","notes"), sep = " ") %>% select(-notes)
h_income_race$race_cat <- gsub('[0-9]+', '', h_income_race$race_cat)
# Detected some non numeric characters in the numeric fields. So need to remove them
h_income_race <- h_income_race %>%
mutate(across(c(hnumber, starts_with("level"), starts_with("me")),~ replace(.,str_detect(., "[A-Za-z]"), NA))) %>% mutate_at(vars(hnumber, starts_with("me"), starts_with("level")), as.numeric)
class(h_income_race$hnumber)
[1] "numeric"
The data needs to be pivoted such that for each race category the household income numbers are viewed under each year (as columns) this can help summarise the data for each race. However with the current data we can see that there are many overlapping instances of the categories of races. It will be easier to group them into a larger group for further analysis and pivoting. The data is then summarised across the numeric columns by calculating their sum.
The data is then pivoted into a wider form, wherein the rows are transformed into columns. This new table only consists of the household income numbers for the different categories of race across various years from 2000-2019. This data can further be used to find the mean or median for a particular year and particular race category.
# Further clean the data
clean_h_income <- h_income_race %>% mutate(
gp_race_cat = case_when(
grepl("BLACK", race_cat, fixed=TRUE) ~ "grp_black",
grepl("ASIAN", race_cat, fixed=TRUE) ~ "grp_asian",
grepl("WHITE", race_cat, fixed=TRUE) ~ "grp_white",
grepl("HISPANIC", race_cat, fixed=TRUE) & !grepl("NOT", race_cat, fixed=TRUE) ~ "grp_hisp",
grepl("ALL", race_cat, fixed=TRUE) ~ "grp_all",
)
) %>% filter(!is.na(gp_race_cat)) %>%
group_by(year, gp_race_cat) %>%
summarise(across(c(starts_with("level"),starts_with("me"),
"hnumber"),
~sum(.x, na.rm=TRUE)))
head(clean_h_income)
# A tibble: 6 × 16
# Groups: year [2]
year gp_race…¹ level1 level2 level3 level4 level5 level6 level7 level8 level9
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1967 grp_all 14.8 10.2 10.9 16.8 24.8 11.9 7.7 1.7 1.2
2 1967 grp_black 26.8 17.7 15.2 16.4 14.8 5.5 2.7 0.6 0.3
3 1967 grp_white 13.5 9.4 10.4 16.9 25.8 12.6 8.2 1.8 1.3
4 1968 grp_all 13.4 10.1 10.4 16.5 24.8 13.7 8.2 1.8 1.1
5 1968 grp_black 24.4 17 15.5 16.2 16.6 6.5 3.2 0.4 0.1
6 1968 grp_white 12.3 9.3 9.8 16.5 25.7 14.5 8.8 1.9 1.2
# … with 5 more variables: median_income <dbl>, median_error <dbl>,
# mean_income <dbl>, mean_error <dbl>, hnumber <dbl>, and abbreviated
# variable name ¹gp_race_cat
# A tibble: 5 × 21
gp_rac…¹ `2000` `2001` `2002` `2003` `2004` `2005` `2006` `2007` `2008` `2009`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 grp_all 108209 109297 111278 112000 113343 114384 116011 116783 117181 117538
2 grp_bla… 13174 13315 27243 27598 27960 28401 29063 29527 29651 29942
3 grp_whi… 170557 171500 172811 173110 174508 175591 177380 177877 178181 178647
4 grp_hisp 10034 10499 11339 11693 12178 12519 12973 13339 13425 13298
5 grp_asi… 3963 4071 7996 8275 8469 8773 9118 9209 9378 9627
# … with 10 more variables: `2010` <dbl>, `2011` <dbl>, `2012` <dbl>,
# `2013` <dbl>, `2014` <dbl>, `2015` <dbl>, `2016` <dbl>, `2017` <dbl>,
# `2018` <dbl>, `2019` <dbl>, and abbreviated variable name ¹gp_race_cat
The cleaned data can also be pivoted by length, wherein there are more rows and lesser number of columns. Here, we can expand the income distribution range for different races through the years.
# Pivoting the data
pivot_data <- clean_h_income %>% ungroup() %>%
select(c(year, gp_race_cat, starts_with("level"),hnumber,)) %>%
pivot_longer(cols=starts_with("level"), names_to="IncomeRange", values_to="percent")
# Replacing the income range levels
pivot_data$IncomeRange <- str_replace_all(pivot_data$IncomeRange, setNames(income_levels,income_vals))
# Calculating the number of household incomes for each income range distribution
pivot_data_longer <- pivot_data %>% mutate(
range_household_number = round((hnumber*percent)/100)
)
pivot_data_longer
# A tibble: 2,151 × 6
year gp_race_cat hnumber IncomeRange percent range_household_number
<chr> <chr> <dbl> <chr> <dbl> <dbl>
1 1967 grp_all 60813 Under $15000 14.8 9000
2 1967 grp_all 60813 $15000 to $29000 10.2 6203
3 1967 grp_all 60813 $25000 to $34999 10.9 6629
4 1967 grp_all 60813 $35000 to $49999 16.8 10217
5 1967 grp_all 60813 $50000 to $74999 24.8 15082
6 1967 grp_all 60813 $75000 to $99999 11.9 7237
7 1967 grp_all 60813 $100000 to $149999 7.7 4683
8 1967 grp_all 60813 1500000 to $199999 1.7 1034
9 1967 grp_all 60813 $200000 and over 1.2 730
10 1967 grp_black 5728 Under $15000 26.8 1535
# … with 2,141 more rows
From this table we can see that the data has been pivoted and increased in length, wherein for each income range, it’s percentage distribution and corresponding household income level for the different races across different years is present.
# A tibble: 477 × 3
# Groups: IncomeRange [9]
IncomeRange year range_household_number
<chr> <chr> <dbl>
1 $50000 to $74999 2013 401751
2 $50000 to $74999 2017 398189
3 $100000 to $149999 2017 364573
4 $100000 to $149999 2013 329403
5 $75000 to $99999 2017 305661
6 $35000 to $49999 2013 302327
7 $75000 to $99999 2013 292607
8 $35000 to $49999 2017 290446
9 Under $15000 2013 266115
10 $15000 to $29000 2013 243972
# … with 467 more rows
This data can be further grouped by income range and year and then arranged in descending order by the range_household_number. This can tell us the income range and the year for which the household income numbers were the highest or the lowest.
Credit - I looked up the solution shared by the professor for assistance on a couple of the portions while doing this challenge as I faced some difficulties while attempting it.
---
title: "Challenge 3 Solutions"
author: "Vishnupriya Varadharaju"
desription: "Tidy Data: Pivoting"
date: "11/10/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_3
- animal_weights
- eggs
- australian_marriage
- usa_households
- sce_labor
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
library("readxl")
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
### Working with USA Households dataset
## Read in data
Firstly, on opening the excel file, it is seen that the format of the data in the file is not straightforward. The first three rows have information regarding the data in this file. This data is provides the household incomes of various races throughout the years from 1967-2019. The percentage distribution (out of 100) for different ranges in income, along with the median and mean income was also in the given dataset. There are 12 categories of the races. There are some overlaps between these categories as well ("All Races" category) and all the years are not present for some of the races.
```{r}
# Reading in the USA Households dataset
# Removing the total column as information is redundant
h_income <- read_excel("_data/USA Households by Total Money Income, Race, and Hispanic Origin of Householder 1967 to 2019.xlsx",
skip=5, n_max = 351, col_names=c("year", "hnumber", "total","level1", "level2",
"level3","level4","level5","level6","level7","level8",
"level9","median_income","median_error",
"mean_income","mean_error") ) %>% select(-total)
income_vals <- c("level1","level2","level3","level4","level5","level6","level7","level8","level9")
income_levels <- c("Under $15000","$15000 to $29000","$25000 to $34999","$35000 to $49999","$50000 to $74999","$75000 to $99999","$100000 to $149999","1500000 to $199999","$200000 and over")
h_income
```
## Briefly describe the data
In the above data, we have the necessary household income values for various throughout different years.
The levels of the income range have also been named accordingly and the mapping is found in the vector 'income_levels'. There is a need to pivot this data in order to get the the races corresponding to each year. Currently, the races are only present atop each section of data. Pivoting will help in grouping the data and making calculations easier against different categories of race.
## Cleaning the data
The current data has the races in Capitals above each group of data. It is best to add the category
of race as a new column for all the rows. Here, we get all the categories of race. However there are some numeric characters also present in them which can be removed.
```{r}
library(stringr)
# Creating a new column named race_cat + removing the rows with string only
# race category
h_income_race <- h_income %>% mutate(race_cat = case_when(
str_detect(year, "[A-Za-z]") ~ year,
TRUE ~ NA_character_
)) %>% fill(race_cat) %>% filter(!str_detect(year, "[A-Za-z]"))
# Removing the notes number from the year and race_cat columns
h_income_race <- h_income_race %>% separate(year, c("year","notes"), sep = " ") %>% select(-notes)
h_income_race$race_cat <- gsub('[0-9]+', '', h_income_race$race_cat)
# Detected some non numeric characters in the numeric fields. So need to remove them
h_income_race <- h_income_race %>%
mutate(across(c(hnumber, starts_with("level"), starts_with("me")),~ replace(.,str_detect(., "[A-Za-z]"), NA))) %>% mutate_at(vars(hnumber, starts_with("me"), starts_with("level")), as.numeric)
class(h_income_race$hnumber)
```
## Pivot the Data
The data needs to be pivoted such that for each race category the household income numbers
are viewed under each year (as columns) this can help summarise the data for each race.
However with the current data we can see that there are many overlapping instances of the
categories of races. It will be easier to group them into a larger group for further analysis and pivoting. The data is then summarised across the numeric columns by calculating their sum.
The data is then pivoted into a wider form, wherein the rows are transformed into columns. This new table only consists of the household income numbers for the different categories of race across various years from 2000-2019. This data can further be used to find the mean or median for a particular year and particular race category.
```{r}
# Further clean the data
clean_h_income <- h_income_race %>% mutate(
gp_race_cat = case_when(
grepl("BLACK", race_cat, fixed=TRUE) ~ "grp_black",
grepl("ASIAN", race_cat, fixed=TRUE) ~ "grp_asian",
grepl("WHITE", race_cat, fixed=TRUE) ~ "grp_white",
grepl("HISPANIC", race_cat, fixed=TRUE) & !grepl("NOT", race_cat, fixed=TRUE) ~ "grp_hisp",
grepl("ALL", race_cat, fixed=TRUE) ~ "grp_all",
)
) %>% filter(!is.na(gp_race_cat)) %>%
group_by(year, gp_race_cat) %>%
summarise(across(c(starts_with("level"),starts_with("me"),
"hnumber"),
~sum(.x, na.rm=TRUE)))
head(clean_h_income)
# Pivot the data only containing household income numbers
clean_h_income %>% select(gp_race_cat, hnumber, year) %>%
pivot_wider(values_from=hnumber, names_from=year) %>% select(c(gp_race_cat, starts_with("20")))
```
The cleaned data can also be pivoted by length, wherein there are more rows and lesser number of columns. Here, we can expand the income distribution range for different races through the years.
```{r}
# Pivoting the data
pivot_data <- clean_h_income %>% ungroup() %>%
select(c(year, gp_race_cat, starts_with("level"),hnumber,)) %>%
pivot_longer(cols=starts_with("level"), names_to="IncomeRange", values_to="percent")
# Replacing the income range levels
pivot_data$IncomeRange <- str_replace_all(pivot_data$IncomeRange, setNames(income_levels,income_vals))
# Calculating the number of household incomes for each income range distribution
pivot_data_longer <- pivot_data %>% mutate(
range_household_number = round((hnumber*percent)/100)
)
pivot_data_longer
```
From this table we can see that the data has been pivoted and increased in length, wherein for each income range, it's percentage distribution and corresponding household income level for the different races across different years is present.
```{r}
pivot_data_longer %>%
group_by(IncomeRange, year) %>%
summarise(across(c(range_household_number),~sum(.x, na.rm=TRUE))) %>%
arrange(desc(range_household_number))
```
This data can be further grouped by income range and year and then arranged in descending order by the range_household_number. This can tell us the income range and the year for which the household income numbers were the highest or the lowest.
Credit - I looked up the solution shared by the professor for assistance on a couple of the portions while doing this challenge as I faced some difficulties while attempting it.