DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 3 Solutions

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Challenge Overview
    • Working with USA Households dataset
  • Read in data
  • Briefly describe the data
  • Cleaning the data
  • Pivot the Data

Challenge 3 Solutions

  • Show All Code
  • Hide All Code

  • View Source
challenge_3
animal_weights
eggs
australian_marriage
usa_households
sce_labor
Author

Vishnupriya Varadharaju

Published

November 10, 2022

Code
library(tidyverse)
library("readxl")
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Working with USA Households dataset

Read in data

Firstly, on opening the excel file, it is seen that the format of the data in the file is not straightforward. The first three rows have information regarding the data in this file. This data is provides the household incomes of various races throughout the years from 1967-2019. The percentage distribution (out of 100) for different ranges in income, along with the median and mean income was also in the given dataset. There are 12 categories of the races. There are some overlaps between these categories as well (“All Races” category) and all the years are not present for some of the races.

Code
# Reading in the USA Households dataset
# Removing the total column as information is redundant
h_income <- read_excel("_data/USA Households by Total Money Income, Race, and Hispanic Origin of Householder 1967 to 2019.xlsx",
         skip=5, n_max = 351, col_names=c("year", "hnumber", "total","level1", "level2",
                                          "level3","level4","level5","level6","level7","level8",
                                          "level9","median_income","median_error",
                                    "mean_income","mean_error") ) %>% select(-total)

income_vals <- c("level1","level2","level3","level4","level5","level6","level7","level8","level9")
income_levels <- c("Under $15000","$15000 to $29000","$25000 to $34999","$35000 to $49999","$50000 to $74999","$75000 to $99999","$100000 to $149999","1500000 to $199999","$200000 and over")
h_income
# A tibble: 351 × 15
   year   hnumber level1 level2 level3 level4 level5 level6 level7 level8 level9
   <chr>  <chr>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 ALL R… <NA>      NA     NA     NA     NA     NA     NA     NA     NA     NA  
 2 2019   128451     9.1    8      8.3   11.7   16.5   12.3   15.5    8.3   10.3
 3 2018   128579    10.1    8.8    8.7   12     17     12.5   15      7.2    8.8
 4 2017 2 127669    10      9.1    9.2   12     16.4   12.4   14.7    7.3    8.9
 5 2017   127586    10.1    9.1    9.2   11.9   16.3   12.6   14.8    7.5    8.5
 6 2016   126224    10.4    9      9.2   12.3   16.7   12.2   15      7.2    8  
 7 2015   125819    10.6   10      9.6   12.1   16.1   12.4   14.9    7.1    7.2
 8 2014   124587    11.4   10.5    9.6   12.6   16.4   12.1   14      6.6    6.8
 9 2013 3 123931    11.4   10.3    9.5   12.5   16.8   12     13.9    6.7    6.9
10 2013 4 122952    11.3   10.4    9.7   13.1   17     12.5   13.6    6.3    6  
# … with 341 more rows, and 4 more variables: median_income <dbl>,
#   median_error <dbl>, mean_income <chr>, mean_error <chr>

Briefly describe the data

In the above data, we have the necessary household income values for various throughout different years. The levels of the income range have also been named accordingly and the mapping is found in the vector ‘income_levels’. There is a need to pivot this data in order to get the the races corresponding to each year. Currently, the races are only present atop each section of data. Pivoting will help in grouping the data and making calculations easier against different categories of race.

Cleaning the data

The current data has the races in Capitals above each group of data. It is best to add the category of race as a new column for all the rows. Here, we get all the categories of race. However there are some numeric characters also present in them which can be removed.

Code
library(stringr)

# Creating a new column named race_cat + removing the rows with string only 
# race category
h_income_race <- h_income %>% mutate(race_cat = case_when(
  str_detect(year, "[A-Za-z]") ~ year,
  TRUE ~ NA_character_
)) %>% fill(race_cat) %>% filter(!str_detect(year, "[A-Za-z]"))

# Removing the notes number from the year and race_cat columns

h_income_race <- h_income_race %>% separate(year, c("year","notes"), sep = " ") %>% select(-notes)
h_income_race$race_cat <- gsub('[0-9]+', '', h_income_race$race_cat)

# Detected some non numeric characters in the numeric fields. So need to remove them
h_income_race <- h_income_race %>%
    mutate(across(c(hnumber, starts_with("level"), starts_with("me")),~ replace(.,str_detect(., "[A-Za-z]"), NA))) %>% mutate_at(vars(hnumber, starts_with("me"), starts_with("level")), as.numeric)

class(h_income_race$hnumber)
[1] "numeric"

Pivot the Data

The data needs to be pivoted such that for each race category the household income numbers are viewed under each year (as columns) this can help summarise the data for each race. However with the current data we can see that there are many overlapping instances of the categories of races. It will be easier to group them into a larger group for further analysis and pivoting. The data is then summarised across the numeric columns by calculating their sum.

The data is then pivoted into a wider form, wherein the rows are transformed into columns. This new table only consists of the household income numbers for the different categories of race across various years from 2000-2019. This data can further be used to find the mean or median for a particular year and particular race category.

Code
# Further clean the data
clean_h_income <- h_income_race %>% mutate(
  gp_race_cat = case_when(
    grepl("BLACK", race_cat, fixed=TRUE) ~ "grp_black",
    grepl("ASIAN", race_cat, fixed=TRUE) ~ "grp_asian",
    grepl("WHITE", race_cat, fixed=TRUE) ~ "grp_white",
    grepl("HISPANIC", race_cat, fixed=TRUE) & !grepl("NOT", race_cat, fixed=TRUE) ~ "grp_hisp",
    grepl("ALL", race_cat, fixed=TRUE) ~ "grp_all",
  )
) %>% filter(!is.na(gp_race_cat)) %>%
  group_by(year, gp_race_cat) %>% 
  summarise(across(c(starts_with("level"),starts_with("me"),
                     "hnumber"), 
                   ~sum(.x, na.rm=TRUE)))

head(clean_h_income)
# A tibble: 6 × 16
# Groups:   year [2]
  year  gp_race…¹ level1 level2 level3 level4 level5 level6 level7 level8 level9
  <chr> <chr>      <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 1967  grp_all     14.8   10.2   10.9   16.8   24.8   11.9    7.7    1.7    1.2
2 1967  grp_black   26.8   17.7   15.2   16.4   14.8    5.5    2.7    0.6    0.3
3 1967  grp_white   13.5    9.4   10.4   16.9   25.8   12.6    8.2    1.8    1.3
4 1968  grp_all     13.4   10.1   10.4   16.5   24.8   13.7    8.2    1.8    1.1
5 1968  grp_black   24.4   17     15.5   16.2   16.6    6.5    3.2    0.4    0.1
6 1968  grp_white   12.3    9.3    9.8   16.5   25.7   14.5    8.8    1.9    1.2
# … with 5 more variables: median_income <dbl>, median_error <dbl>,
#   mean_income <dbl>, mean_error <dbl>, hnumber <dbl>, and abbreviated
#   variable name ¹​gp_race_cat
Code
# Pivot the data only containing household income numbers
clean_h_income %>% select(gp_race_cat, hnumber, year) %>%
  pivot_wider(values_from=hnumber, names_from=year) %>% select(c(gp_race_cat, starts_with("20")))
# A tibble: 5 × 21
  gp_rac…¹ `2000` `2001` `2002` `2003` `2004` `2005` `2006` `2007` `2008` `2009`
  <chr>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 grp_all  108209 109297 111278 112000 113343 114384 116011 116783 117181 117538
2 grp_bla…  13174  13315  27243  27598  27960  28401  29063  29527  29651  29942
3 grp_whi… 170557 171500 172811 173110 174508 175591 177380 177877 178181 178647
4 grp_hisp  10034  10499  11339  11693  12178  12519  12973  13339  13425  13298
5 grp_asi…   3963   4071   7996   8275   8469   8773   9118   9209   9378   9627
# … with 10 more variables: `2010` <dbl>, `2011` <dbl>, `2012` <dbl>,
#   `2013` <dbl>, `2014` <dbl>, `2015` <dbl>, `2016` <dbl>, `2017` <dbl>,
#   `2018` <dbl>, `2019` <dbl>, and abbreviated variable name ¹​gp_race_cat

The cleaned data can also be pivoted by length, wherein there are more rows and lesser number of columns. Here, we can expand the income distribution range for different races through the years.

Code
# Pivoting the data
pivot_data <- clean_h_income %>% ungroup() %>%
  select(c(year, gp_race_cat, starts_with("level"),hnumber,)) %>%
  pivot_longer(cols=starts_with("level"), names_to="IncomeRange", values_to="percent")

# Replacing the income range levels
pivot_data$IncomeRange <- str_replace_all(pivot_data$IncomeRange, setNames(income_levels,income_vals))

# Calculating the number of household incomes for each income range distribution
pivot_data_longer <- pivot_data %>% mutate(
  range_household_number = round((hnumber*percent)/100)
)

pivot_data_longer
# A tibble: 2,151 × 6
   year  gp_race_cat hnumber IncomeRange        percent range_household_number
   <chr> <chr>         <dbl> <chr>                <dbl>                  <dbl>
 1 1967  grp_all       60813 Under $15000          14.8                   9000
 2 1967  grp_all       60813 $15000 to $29000      10.2                   6203
 3 1967  grp_all       60813 $25000 to $34999      10.9                   6629
 4 1967  grp_all       60813 $35000 to $49999      16.8                  10217
 5 1967  grp_all       60813 $50000 to $74999      24.8                  15082
 6 1967  grp_all       60813 $75000 to $99999      11.9                   7237
 7 1967  grp_all       60813 $100000 to $149999     7.7                   4683
 8 1967  grp_all       60813 1500000 to $199999     1.7                   1034
 9 1967  grp_all       60813 $200000 and over       1.2                    730
10 1967  grp_black      5728 Under $15000          26.8                   1535
# … with 2,141 more rows

From this table we can see that the data has been pivoted and increased in length, wherein for each income range, it’s percentage distribution and corresponding household income level for the different races across different years is present.

Code
pivot_data_longer %>% 
  group_by(IncomeRange, year) %>%
  summarise(across(c(range_household_number),~sum(.x, na.rm=TRUE))) %>% 
  arrange(desc(range_household_number))
# A tibble: 477 × 3
# Groups:   IncomeRange [9]
   IncomeRange        year  range_household_number
   <chr>              <chr>                  <dbl>
 1 $50000 to $74999   2013                  401751
 2 $50000 to $74999   2017                  398189
 3 $100000 to $149999 2017                  364573
 4 $100000 to $149999 2013                  329403
 5 $75000 to $99999   2017                  305661
 6 $35000 to $49999   2013                  302327
 7 $75000 to $99999   2013                  292607
 8 $35000 to $49999   2017                  290446
 9 Under $15000       2013                  266115
10 $15000 to $29000   2013                  243972
# … with 467 more rows

This data can be further grouped by income range and year and then arranged in descending order by the range_household_number. This can tell us the income range and the year for which the household income numbers were the highest or the lowest.

Credit - I looked up the solution shared by the professor for assistance on a couple of the portions while doing this challenge as I faced some difficulties while attempting it.

Source Code
---
title: "Challenge 3 Solutions"
author: "Vishnupriya Varadharaju"
desription: "Tidy Data: Pivoting"
date: "11/10/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_3
  - animal_weights
  - eggs
  - australian_marriage
  - usa_households
  - sce_labor
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)
library("readxl")
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

## Challenge Overview

### Working with USA Households dataset

## Read in data

Firstly, on opening the excel file, it is seen that the format of the data in the file is not straightforward. The first three rows have information regarding the data in this file. This data is provides the household incomes of various races throughout the years from 1967-2019. The percentage distribution (out of 100) for different ranges in income, along with the median and mean income was also in the given dataset. There are 12 categories of the races. There are some overlaps between these categories as well ("All Races" category) and all the years are not present for some of the races.

```{r}
# Reading in the USA Households dataset
# Removing the total column as information is redundant
h_income <- read_excel("_data/USA Households by Total Money Income, Race, and Hispanic Origin of Householder 1967 to 2019.xlsx",
         skip=5, n_max = 351, col_names=c("year", "hnumber", "total","level1", "level2",
                                          "level3","level4","level5","level6","level7","level8",
                                          "level9","median_income","median_error",
                                    "mean_income","mean_error") ) %>% select(-total)

income_vals <- c("level1","level2","level3","level4","level5","level6","level7","level8","level9")
income_levels <- c("Under $15000","$15000 to $29000","$25000 to $34999","$35000 to $49999","$50000 to $74999","$75000 to $99999","$100000 to $149999","1500000 to $199999","$200000 and over")
h_income

```

## Briefly describe the data

In the above data, we have the necessary household income values for various throughout different years.
The levels of the income range have also been named accordingly and the mapping is found in the vector 'income_levels'. There is a need to pivot this data in order to get the the races corresponding to each year. Currently, the races are only present atop each section of data. Pivoting will help in grouping the data and making calculations easier against different categories of race.

## Cleaning the data

The current data has the races in Capitals above each group of data. It is best to add the category 
of race as a new column for all the rows. Here, we get all the categories of race. However there are some numeric characters also present in them which can be removed.

```{r}
library(stringr)

# Creating a new column named race_cat + removing the rows with string only 
# race category
h_income_race <- h_income %>% mutate(race_cat = case_when(
  str_detect(year, "[A-Za-z]") ~ year,
  TRUE ~ NA_character_
)) %>% fill(race_cat) %>% filter(!str_detect(year, "[A-Za-z]"))

# Removing the notes number from the year and race_cat columns

h_income_race <- h_income_race %>% separate(year, c("year","notes"), sep = " ") %>% select(-notes)
h_income_race$race_cat <- gsub('[0-9]+', '', h_income_race$race_cat)

# Detected some non numeric characters in the numeric fields. So need to remove them
h_income_race <- h_income_race %>%
    mutate(across(c(hnumber, starts_with("level"), starts_with("me")),~ replace(.,str_detect(., "[A-Za-z]"), NA))) %>% mutate_at(vars(hnumber, starts_with("me"), starts_with("level")), as.numeric)

class(h_income_race$hnumber)
```


## Pivot the Data

The data needs to be pivoted such that for each race category the household income numbers
are viewed under each year (as columns) this can help summarise the data for each race. 
However with the current data we can see that there are many overlapping instances of the
categories of races. It will be easier to group them into a larger group for further analysis and pivoting. The data is then summarised across the numeric columns by calculating their sum. 

The data is then pivoted into a wider form, wherein the rows are transformed into columns. This new table only consists of the household income numbers for the different categories of race across various years from 2000-2019. This data can further be used to find the mean or median for a particular year and particular race category. 

```{r}
# Further clean the data
clean_h_income <- h_income_race %>% mutate(
  gp_race_cat = case_when(
    grepl("BLACK", race_cat, fixed=TRUE) ~ "grp_black",
    grepl("ASIAN", race_cat, fixed=TRUE) ~ "grp_asian",
    grepl("WHITE", race_cat, fixed=TRUE) ~ "grp_white",
    grepl("HISPANIC", race_cat, fixed=TRUE) & !grepl("NOT", race_cat, fixed=TRUE) ~ "grp_hisp",
    grepl("ALL", race_cat, fixed=TRUE) ~ "grp_all",
  )
) %>% filter(!is.na(gp_race_cat)) %>%
  group_by(year, gp_race_cat) %>% 
  summarise(across(c(starts_with("level"),starts_with("me"),
                     "hnumber"), 
                   ~sum(.x, na.rm=TRUE)))

head(clean_h_income)

# Pivot the data only containing household income numbers
clean_h_income %>% select(gp_race_cat, hnumber, year) %>%
  pivot_wider(values_from=hnumber, names_from=year) %>% select(c(gp_race_cat, starts_with("20")))
```

The cleaned data can also be pivoted by length, wherein there are more rows and lesser number of columns. Here, we can expand the income distribution range for different races through the years.

```{r}
# Pivoting the data
pivot_data <- clean_h_income %>% ungroup() %>%
  select(c(year, gp_race_cat, starts_with("level"),hnumber,)) %>%
  pivot_longer(cols=starts_with("level"), names_to="IncomeRange", values_to="percent")

# Replacing the income range levels
pivot_data$IncomeRange <- str_replace_all(pivot_data$IncomeRange, setNames(income_levels,income_vals))

# Calculating the number of household incomes for each income range distribution
pivot_data_longer <- pivot_data %>% mutate(
  range_household_number = round((hnumber*percent)/100)
)

pivot_data_longer
```

From this table we can see that the data has been pivoted and increased in length, wherein for each income range, it's percentage distribution and corresponding household income level for the different races across different years is present. 


```{r}
pivot_data_longer %>% 
  group_by(IncomeRange, year) %>%
  summarise(across(c(range_household_number),~sum(.x, na.rm=TRUE))) %>% 
  arrange(desc(range_household_number))
  
```

This data can be further grouped by income range and year and then arranged in descending order by the range_household_number. This can tell us the income range and the year for which the household income numbers were the highest or the lowest.

Credit - I looked up the solution shared by the professor for assistance on a couple of the portions while doing this challenge as I faced some difficulties while attempting it.