DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 3

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Challenge Overview
  • Data Overview
    • Data Description
  • Anticipated End Result
  • Pivot the Data
    • Credit

Challenge 3

  • Show All Code
  • Hide All Code

  • View Source
challenge_3
animal_weights
eggs
australian_marriage
usa_households
sce_labor
Author

Matthew O’Neill

Published

October 7, 2022

Code
library(tidyverse)
library(readxl)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. identify what needs to be done to tidy the current data
  3. anticipate the shape of pivoted data
  4. pivot the data into tidy format using pivot_longer

Data Overview

I’ve chosen the US households data

Code
#Code below was taken from Meredith's as I wasn't entirely sure what the most useful
#dataset structure would be
income_brackets <- c(i1 = "Under $15,000",
                     i2 = "$15,000 to $24,999",
                     i3 = "$25,000 to $34,999",
                     i4= "$35,000 to $49,999",
                     i5 = "$50,000 to $74,999",
                     i6 = "$75,000 to $99,999",
                     i7 = "$100,000 to $149,999",
                     i8 = "$150,000 to $199,999",
                     i9 = "$200,000 and over")

data <- read_excel("../posts/_data/USA Households by Total Money Income, Race, and Hispanic Origin of Householder 1967 to 2019.xlsx", sheet=NULL, range=NULL,
         skip=5,
         n_max = 352,
         col_names = c("year", "hholds", "del",
                       str_c("income",1:9,sep="_i"),
                       "median_inc", "median_se", 
                       "mean_inc","mean_se"))%>%
  select(-del)
data
# A tibble: 352 × 15
   year   hholds incom…¹ incom…² incom…³ incom…⁴ incom…⁵ incom…⁶ incom…⁷ incom…⁸
   <chr>  <chr>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 ALL R… <NA>      NA      NA      NA      NA      NA      NA      NA      NA  
 2 2019   128451     9.1     8       8.3    11.7    16.5    12.3    15.5     8.3
 3 2018   128579    10.1     8.8     8.7    12      17      12.5    15       7.2
 4 2017 2 127669    10       9.1     9.2    12      16.4    12.4    14.7     7.3
 5 2017   127586    10.1     9.1     9.2    11.9    16.3    12.6    14.8     7.5
 6 2016   126224    10.4     9       9.2    12.3    16.7    12.2    15       7.2
 7 2015   125819    10.6    10       9.6    12.1    16.1    12.4    14.9     7.1
 8 2014   124587    11.4    10.5     9.6    12.6    16.4    12.1    14       6.6
 9 2013 3 123931    11.4    10.3     9.5    12.5    16.8    12      13.9     6.7
10 2013 4 122952    11.3    10.4     9.7    13.1    17      12.5    13.6     6.3
# … with 342 more rows, 5 more variables: income_i9 <dbl>, median_inc <dbl>,
#   median_se <dbl>, mean_inc <chr>, mean_se <chr>, and abbreviated variable
#   names ¹​income_i1, ²​income_i2, ³​income_i3, ⁴​income_i4, ⁵​income_i5,
#   ⁶​income_i6, ⁷​income_i7, ⁸​income_i8

Data Description

The dataset I decided to work with this week was the USA household data, which breaks down US households by income, race, and hispanic origin from 1967 to 2019. It contains the number of households surveyed in a given year, what income bracket each was in, andthe mean and median for each year and group.

This dataeet could be very useful in an analysis about wealth inequality by race and to look at the kinds of trends there might be over the 50 or so year period that data was collected.

We could use pivot tables here since we are interested in aggregates of different groups, and we could specifically build a table to see what proportion of each income bracket was made up of which races and ethnicities.

One thing we will want to do before we start however, is clean up our data a bit, because instead of race being a column in the dataset, it is currently just a divider between sections.

Code
#I once again used code from the solution as I couldn't figure out how to organize the data
#properly. I feel I have a much better idea of how to clean data after this exercise.

#I planned to only use the first part to remove empty lines between sections but the second part is very useful for the next section where we group alone and not alone.
data<-data%>%
  mutate(identity = case_when(
    str_detect(year, "[[:alpha:]]") ~ year,
    TRUE ~ NA_character_
  ))%>%
  fill(identity)%>%
  filter(!str_detect(year, "[[:alpha:]]"))

data<-data%>%
  separate(year, into=c("year", "delete"), sep=" ")%>%
  mutate(identity = str_remove(identity, " [0-9]+"),
         across(any_of(c("hholds", "mean_inc", "mean_se", "year")), 
                parse_number))%>%
  select(-delete)

data
# A tibble: 340 × 16
    year hholds income…¹ incom…² incom…³ incom…⁴ incom…⁵ incom…⁶ incom…⁷ incom…⁸
   <dbl>  <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1  2019 128451      9.1     8       8.3    11.7    16.5    12.3    15.5     8.3
 2  2018 128579     10.1     8.8     8.7    12      17      12.5    15       7.2
 3  2017 127669     10       9.1     9.2    12      16.4    12.4    14.7     7.3
 4  2017 127586     10.1     9.1     9.2    11.9    16.3    12.6    14.8     7.5
 5  2016 126224     10.4     9       9.2    12.3    16.7    12.2    15       7.2
 6  2015 125819     10.6    10       9.6    12.1    16.1    12.4    14.9     7.1
 7  2014 124587     11.4    10.5     9.6    12.6    16.4    12.1    14       6.6
 8  2013 123931     11.4    10.3     9.5    12.5    16.8    12      13.9     6.7
 9  2013 122952     11.3    10.4     9.7    13.1    17      12.5    13.6     6.3
10  2012 122459     11.4    10.6    10.1    12.5    17.4    12      13.9     6.3
# … with 330 more rows, 6 more variables: income_i9 <dbl>, median_inc <dbl>,
#   median_se <dbl>, mean_inc <dbl>, mean_se <dbl>, identity <chr>, and
#   abbreviated variable names ¹​income_i1, ²​income_i2, ³​income_i3, ⁴​income_i4,
#   ⁵​income_i5, ⁶​income_i6, ⁷​income_i7, ⁸​income_i8

Now the dataset is in much better shape, with an “identity” column which we will be able to use to make comparisons, and we no longer have to worry about empty rows signifying a change in race. We will want to make this a bit more condensed still, as the data contains distinctions between households with alone people and not alone, but we ware mainly just interested in a potential inequality between races, so we can safely group the alone and not alone rows into groups. For example we will label the group of all data for white households as “gp_white”.

Code
#I used the code from the solution here to group alone and not alone, as well as add a 
#"all races" section. The filter is also needed to combine values of alone and not alone.
data <-data%>%
  mutate(gp_identity = case_when(
   identity %in% c("BLACK", "BLACK ALONE") ~ "gp_black",
    identity %in% c("ASIAN ALONE OR IN COMBINATION",
                  "ASIAN AND PACIFIC ISLANDER") ~ "gp_asian",
    identity %in% c("WHITE, NOT HISPANIC", 
                    "WHITE ALONE, NOT HISPANIC") ~ "gp_white",
    identity %in% c("HISPANIC (ANY RACE)") ~ "gp_hisp",
    identity %in% c("ALL RACES") ~ "gp_all"
  ))%>%
  filter(!is.na(gp_identity))%>%
    group_by(year, gp_identity)%>%
    summarise(across(c(starts_with("inc"),starts_with("me"),
                      "hholds"), 
                      ~sum(.x, na.rm=TRUE)))%>%
  ungroup()
data
# A tibble: 235 × 16
    year gp_id…¹ incom…² incom…³ incom…⁴ incom…⁵ incom…⁶ incom…⁷ incom…⁸ incom…⁹
   <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1  1967 gp_all     14.8    10.2    10.9    16.8    24.8    11.9     7.7     1.7
 2  1967 gp_bla…    26.8    17.7    15.2    16.4    14.8     5.5     2.7     0.6
 3  1968 gp_all     13.4    10.1    10.4    16.5    24.8    13.7     8.2     1.8
 4  1968 gp_bla…    24.4    17      15.5    16.2    16.6     6.5     3.2     0.4
 5  1969 gp_all     13.2     9.9     9.4    16      24.3    14.2     9.4     2.2
 6  1969 gp_bla…    24      16.1    14.1    17.7    16.9     6.7     3.8     0.5
 7  1970 gp_all     13.3    10.1     9.7    15.7    24.1    13.7     9.9     2.2
 8  1970 gp_bla…    24.1    15.8    13.7    17.1    16.7     7.4     4.4     0.6
 9  1971 gp_all     13.3    10.2    10.6    15.2    23.5    13.9     9.7     2.2
10  1971 gp_bla…    24.7    16.2    14.2    16.3    16.7     7.3     3.8     0.6
# … with 225 more rows, 6 more variables: income_i9 <dbl>, median_inc <dbl>,
#   median_se <dbl>, mean_inc <dbl>, mean_se <dbl>, hholds <dbl>, and
#   abbreviated variable names ¹​gp_identity, ²​income_i1, ³​income_i2,
#   ⁴​income_i3, ⁵​income_i4, ⁶​income_i5, ⁷​income_i6, ⁸​income_i7, ⁹​income_i8

Anticipated End Result

The end product should look like a table with the percentage of each income bracket that each race makes up. This could shed light on races being disproportionally represented in different income brackets.

Ideally we would have each year, race, and income bracket as seperate rows to see the income of households over time and what proportion of each income bracket they made up in a given year.

Code
data
# A tibble: 235 × 16
    year gp_id…¹ incom…² incom…³ incom…⁴ incom…⁵ incom…⁶ incom…⁷ incom…⁸ incom…⁹
   <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1  1967 gp_all     14.8    10.2    10.9    16.8    24.8    11.9     7.7     1.7
 2  1967 gp_bla…    26.8    17.7    15.2    16.4    14.8     5.5     2.7     0.6
 3  1968 gp_all     13.4    10.1    10.4    16.5    24.8    13.7     8.2     1.8
 4  1968 gp_bla…    24.4    17      15.5    16.2    16.6     6.5     3.2     0.4
 5  1969 gp_all     13.2     9.9     9.4    16      24.3    14.2     9.4     2.2
 6  1969 gp_bla…    24      16.1    14.1    17.7    16.9     6.7     3.8     0.5
 7  1970 gp_all     13.3    10.1     9.7    15.7    24.1    13.7     9.9     2.2
 8  1970 gp_bla…    24.1    15.8    13.7    17.1    16.7     7.4     4.4     0.6
 9  1971 gp_all     13.3    10.2    10.6    15.2    23.5    13.9     9.7     2.2
10  1971 gp_bla…    24.7    16.2    14.2    16.3    16.7     7.3     3.8     0.6
# … with 225 more rows, 6 more variables: income_i9 <dbl>, median_inc <dbl>,
#   median_se <dbl>, mean_inc <dbl>, mean_se <dbl>, hholds <dbl>, and
#   abbreviated variable names ¹​gp_identity, ²​income_i1, ³​income_i2,
#   ⁴​income_i3, ⁵​income_i4, ⁶​income_i5, ⁷​income_i6, ⁸​income_i7, ⁹​income_i8
Code
nrow(data)
[1] 235
Code
ncol(data)
[1] 16

Now our current dataset contains 235 rows and 16 columns, but we want to pivot each income bracket column into one single variable, and since there are 9 income brackets, we should expect there to be 9*235, or 2115 rows in our new dataset.

We are not interested in the means and medians currently, so we won’t select those columns when we do out pivot.

The 9 income bracket columns will shrink to 1 column, and 1 column will be added to hold the percentage value, so we should expect 6 columns: year, race, number of households, income bracket, and percentile.

Thus, we expect our output to be 2115 rows and 6 columns,

Pivot the Data

Code
#FInally, I used solution code again here. Pivot_longer was a function that I was confused
#about, but the solution made thigns very clear, as we are reducing the income columns to see
#them as rows instead. 
#The second part gives more description to the income bracket types.
brackets <-data%>%
  ungroup()%>%
  select(year, gp_identity, hholds, starts_with("income"))%>%
  pivot_longer(cols=starts_with("income"),
               names_prefix= "income_",
               names_to = "income_bracket",
               values_to = "percent")%>%
  mutate(hholds_bracket = round(hholds*(percent/100)),
         income_bracket = recode(income_bracket,!!!income_brackets))
brackets
# A tibble: 2,115 × 6
    year gp_identity hholds income_bracket       percent hholds_bracket
   <dbl> <chr>        <dbl> <chr>                  <dbl>          <dbl>
 1  1967 gp_all       60813 Under $15,000           14.8           9000
 2  1967 gp_all       60813 $15,000 to $24,999      10.2           6203
 3  1967 gp_all       60813 $25,000 to $34,999      10.9           6629
 4  1967 gp_all       60813 $35,000 to $49,999      16.8          10217
 5  1967 gp_all       60813 $50,000 to $74,999      24.8          15082
 6  1967 gp_all       60813 $75,000 to $99,999      11.9           7237
 7  1967 gp_all       60813 $100,000 to $149,999     7.7           4683
 8  1967 gp_all       60813 $150,000 to $199,999     1.7           1034
 9  1967 gp_all       60813 $200,000 and over        1.2            730
10  1967 gp_black      5728 Under $15,000           26.8           1535
# … with 2,105 more rows
Code
nrow(brackets)
[1] 2115
Code
ncol(brackets)
[1] 6

We can see that our rows and columns are as expected, and we now have a nice table showing the percent of households of each race in every income bracket in any given year.

Credit

I found myself stuck on various data cleaning steps in this exercise, as well as figuring out how to use pivot tables to get the data in the way we wanted, so I followed the code from the solutions for this exercise.

Overall, I needed some help wit hexamples of how to apply R functions to clean data and change it into a more useful form, and I will hopefulyl have a much easier time doing so in future challenges and assignments.

Source Code
---
title: "Challenge 3"
author: "Matthew O'Neill"
desription: "Tidy Data: Pivoting"
date: "10/07/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_3
  - animal_weights
  - eggs
  - australian_marriage
  - usa_households
  - sce_labor
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)
library(readxl)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

## Challenge Overview

Today's challenge is to:

1.  read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
2.  identify what needs to be done to tidy the current data
3.  anticipate the shape of pivoted data
4.  pivot the data into tidy format using `pivot_longer`

## Data Overview

I've chosen the US households data

```{r}
#Code below was taken from Meredith's as I wasn't entirely sure what the most useful
#dataset structure would be
income_brackets <- c(i1 = "Under $15,000",
                     i2 = "$15,000 to $24,999",
                     i3 = "$25,000 to $34,999",
                     i4= "$35,000 to $49,999",
                     i5 = "$50,000 to $74,999",
                     i6 = "$75,000 to $99,999",
                     i7 = "$100,000 to $149,999",
                     i8 = "$150,000 to $199,999",
                     i9 = "$200,000 and over")

data <- read_excel("../posts/_data/USA Households by Total Money Income, Race, and Hispanic Origin of Householder 1967 to 2019.xlsx", sheet=NULL, range=NULL,
         skip=5,
         n_max = 352,
         col_names = c("year", "hholds", "del",
                       str_c("income",1:9,sep="_i"),
                       "median_inc", "median_se", 
                       "mean_inc","mean_se"))%>%
  select(-del)
data

```

### Data Description

The dataset I decided to work with this week was the USA household data, which breaks down US households by income, race, and hispanic origin from 1967 to 2019. It contains the number of households surveyed in a given year, what income bracket each was in, andthe mean and median for each year and group.

This dataeet could be very useful in an analysis about wealth inequality by race and to look at the kinds of trends there might be over the 50 or so year period that data was collected.

We could use pivot tables here since we are interested in aggregates of different groups, and we could specifically build a table to see what proportion of each income bracket was made up of which races and ethnicities.

One thing we will want to do before we start however, is clean up our data a bit, because instead of race being a column in the dataset, it is currently just a divider between sections. 

```{r}
#I once again used code from the solution as I couldn't figure out how to organize the data
#properly. I feel I have a much better idea of how to clean data after this exercise.

#I planned to only use the first part to remove empty lines between sections but the second part is very useful for the next section where we group alone and not alone.
data<-data%>%
  mutate(identity = case_when(
    str_detect(year, "[[:alpha:]]") ~ year,
    TRUE ~ NA_character_
  ))%>%
  fill(identity)%>%
  filter(!str_detect(year, "[[:alpha:]]"))

data<-data%>%
  separate(year, into=c("year", "delete"), sep=" ")%>%
  mutate(identity = str_remove(identity, " [0-9]+"),
         across(any_of(c("hholds", "mean_inc", "mean_se", "year")), 
                parse_number))%>%
  select(-delete)

data

```

Now the dataset is in much better shape, with an "identity" column which we will be able to use to make comparisons, and we no longer have to worry about empty rows signifying a change in race. We will want to make this a bit more condensed still, as the data contains distinctions between households with alone people and not alone, but we ware mainly just interested in a potential inequality between races, so we can safely group the alone and not alone rows into groups. For example we will label the group of all data for white households as "gp_white".


```{r}
#I used the code from the solution here to group alone and not alone, as well as add a 
#"all races" section. The filter is also needed to combine values of alone and not alone.
data <-data%>%
  mutate(gp_identity = case_when(
   identity %in% c("BLACK", "BLACK ALONE") ~ "gp_black",
    identity %in% c("ASIAN ALONE OR IN COMBINATION",
                  "ASIAN AND PACIFIC ISLANDER") ~ "gp_asian",
    identity %in% c("WHITE, NOT HISPANIC", 
                    "WHITE ALONE, NOT HISPANIC") ~ "gp_white",
    identity %in% c("HISPANIC (ANY RACE)") ~ "gp_hisp",
    identity %in% c("ALL RACES") ~ "gp_all"
  ))%>%
  filter(!is.na(gp_identity))%>%
    group_by(year, gp_identity)%>%
    summarise(across(c(starts_with("inc"),starts_with("me"),
                      "hholds"), 
                      ~sum(.x, na.rm=TRUE)))%>%
  ungroup()
data

```




## Anticipated End Result

The end product should look like a table with the percentage of each income bracket that each race makes up. This could shed light on races being disproportionally represented in different income brackets.

Ideally we would have each year, race, and income bracket as seperate rows to see the income of households over time and what proportion of each income bracket they made up in a given year.


```{r}
data
nrow(data)
ncol(data)

```

Now our current dataset contains 235 rows and 16 columns, but we want to pivot each income bracket column into one single variable, and since there are 9 income brackets, we should expect there to be 9*235, or 2115 rows in our new dataset. 

We are not interested in the means and medians currently, so we won't select those columns when we do out pivot.

The 9 income bracket columns will shrink to 1 column, and 1 column will be added to hold the percentage value, so we should expect 6 columns: year, race, number of households, income bracket, and percentile. 

Thus, we expect our output to be 2115 rows and 6 columns,

## Pivot the Data


```{r}
#FInally, I used solution code again here. Pivot_longer was a function that I was confused
#about, but the solution made thigns very clear, as we are reducing the income columns to see
#them as rows instead. 
#The second part gives more description to the income bracket types.
brackets <-data%>%
  ungroup()%>%
  select(year, gp_identity, hholds, starts_with("income"))%>%
  pivot_longer(cols=starts_with("income"),
               names_prefix= "income_",
               names_to = "income_bracket",
               values_to = "percent")%>%
  mutate(hholds_bracket = round(hholds*(percent/100)),
         income_bracket = recode(income_bracket,!!!income_brackets))
brackets

nrow(brackets)
ncol(brackets)
```

We can see that our rows and columns are as expected, and we now have a nice table showing the percent of households of each race in every income bracket in any given year.

### Credit

I found myself stuck on various data cleaning steps in this exercise, as well as figuring out how to use pivot tables to get the data in the way we wanted, so I followed the code from the solutions for this exercise.

Overall, I needed some help wit hexamples of how to apply R functions to clean data and change it into a more useful form, and I will hopefulyl have a much easier time doing so in future challenges and assignments.