challenge_3
tidyverse
readxl
dplyr
tidyr
households
Author

Saaradhaa M

Published

August 17, 2022

Code
library(tidyverse)
library(readxl)
library(dplyr)
library(tidyr)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Reading in data

I will be working with the households dataset.

Code
# Reading in data.
households <-read_excel("_data/USA Households by Total Money Income, Race, and Hispanic Origin of Householder 1967 to 2019.xlsx", skip=4)
households
# A tibble: 383 × 16
   ...1      ...2  Total Under…¹ $15,0…² $25,0…³ $35,0…⁴ $50,0…⁵ $75,0…⁶ $100,…⁷
   <chr>     <chr> <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 ALL RACES <NA>     NA    NA      NA      NA      NA      NA      NA      NA  
 2 2019      1284…   100     9.1     8       8.3    11.7    16.5    12.3    15.5
 3 2018      1285…   100    10.1     8.8     8.7    12      17      12.5    15  
 4 2017 2    1276…   100    10       9.1     9.2    12      16.4    12.4    14.7
 5 2017      1275…   100    10.1     9.1     9.2    11.9    16.3    12.6    14.8
 6 2016      1262…   100    10.4     9       9.2    12.3    16.7    12.2    15  
 7 2015      1258…   100    10.6    10       9.6    12.1    16.1    12.4    14.9
 8 2014      1245…   100    11.4    10.5     9.6    12.6    16.4    12.1    14  
 9 2013 3    1239…   100    11.4    10.3     9.5    12.5    16.8    12      13.9
10 2013 4    1229…   100    11.3    10.4     9.7    13.1    17      12.5    13.6
# … with 373 more rows, 6 more variables: `$150,000\r\nto\r\n$199,999` <dbl>,
#   `$200,000 and over` <dbl>, Estimate...13 <dbl>,
#   `Margin of error1 (±)...14` <dbl>, Estimate...15 <chr>,
#   `Margin of error1 (±)...16` <chr>, and abbreviated variable names
#   ¹​`Under $15,000`, ²​`$15,000\r\nto\r\n$24,999`, ³​`$25,000\r\nto\r\n$34,999`,
#   ⁴​`$35,000\r\nto\r\n$49,999`, ⁵​`$50,000\r\nto\r\n$74,999`,
#   ⁶​`$75,000\r\nto\r\n$99,999`, ⁷​`$100,000\r\nto\r\n$149,999`
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

Brief description of data

When reading in the data, I skipped the first four rows (they just describe the dataset). The dataset segments households by their income levels, race and Hispanic origin of householder from 1967 to 2019. It also has an external link to https://www2.census.gov/programs-surveys/cps/techdocs/cpsmar20.pdf, which shows that the data is part of the Annual Social and Economic Supplement in the Census. I need to re-name some headers, remove “Total” (redundant column) and remove rows 353 to 383 (they’re just notes).

Code
# Rename column headers.
colnames(households)
 [1] "...1"                       "...2"                      
 [3] "Total"                      "Under $15,000"             
 [5] "$15,000\r\nto\r\n$24,999"   "$25,000\r\nto\r\n$34,999"  
 [7] "$35,000\r\nto\r\n$49,999"   "$50,000\r\nto\r\n$74,999"  
 [9] "$75,000\r\nto\r\n$99,999"   "$100,000\r\nto\r\n$149,999"
[11] "$150,000\r\nto\r\n$199,999" "$200,000 and over"         
[13] "Estimate...13"              "Margin of error1 (±)...14" 
[15] "Estimate...15"              "Margin of error1 (±)...16" 
Code
households <- rename(households, "year" = "...1", "num_thousands" = "...2", "estimated_median_income" = "Estimate...13", "median_moe" = "Margin of error1 (±)...14", "estimated_mean_income" = "Estimate...15", "mean_moe" = "Margin of error1 (±)...16")
# Remove "Total" column.
households <- households[,-3]
# Remove rows 353-383.
households <- households[-c(353:383),]
# Which rows have missing values? This tells me how many rows of races there are.
which(rowSums(is.na(households))>0)
 [1]   1  57  78 114 135 166 187 208 244 265 286 302

Anticipate end result and find current and future data dimensions.

Now the dataset is a lot cleaner. We can see that in the “year” column, there are rows of races (N = 12). Race should actually be entered as a separate column, but I don’t know how to select specific rows in the “year” column to create a new column. I’ll remove those rows just for the purposes of working through this exercise.

Code
# Remove race rows.
households_new <- households[-c(1,57,78,114,135,166,187,208,244,265,286,302),]
dim(households_new)
[1] 340  15

The current dimensions are 340 rows and 15 columns. I would like to shift all the income categories into an “income” column, so this should give me a lot more rows and 8 columns.

Pivoting dataset

Code
# Attempt pivotlonger().
households_new <- pivot_longer(households_new, cols = contains("$"), names_to = "income", values_to = "proportion")
households_new
# A tibble: 3,060 × 8
   year  num_thousands estimated_median…¹ media…² estim…³ mean_…⁴ income propo…⁵
   <chr> <chr>                      <dbl>   <dbl> <chr>   <chr>   <chr>    <dbl>
 1 2019  128451                     68703     904 98088   1042    "Unde…     9.1
 2 2019  128451                     68703     904 98088   1042    "$15,…     8  
 3 2019  128451                     68703     904 98088   1042    "$25,…     8.3
 4 2019  128451                     68703     904 98088   1042    "$35,…    11.7
 5 2019  128451                     68703     904 98088   1042    "$50,…    16.5
 6 2019  128451                     68703     904 98088   1042    "$75,…    12.3
 7 2019  128451                     68703     904 98088   1042    "$100…    15.5
 8 2019  128451                     68703     904 98088   1042    "$150…     8.3
 9 2019  128451                     68703     904 98088   1042    "$200…    10.3
10 2018  128579                     64324     704 91652   914     "Unde…    10.1
# … with 3,050 more rows, and abbreviated variable names
#   ¹​estimated_median_income, ²​median_moe, ³​estimated_mean_income, ⁴​mean_moe,
#   ⁵​proportion
# ℹ Use `print(n = ...)` to see more rows

Now we suddenly have >3000 rows. This is because the columns estimated_median_income, median_moe, estimated_mean_income and mean_moe are the same for each year (regardless of income bracket, which we’ve just pivoted into a new column). So I’m going to split the data into two tables to make it easier to understand.

Code
# Creating table 1 by removing appropriate columns.
households_1 <- households_new[,-c(3:6)]

# Changing num_thousands to numeric so that the next argument runs properly.
households_1$num_thousands <- as.numeric(households_1$num_thousands)

# Merging 2 columns into 1.
households_1 <- households_1 %>% mutate(count_thousands = `num_thousands`*(`proportion`/100))

# Removing the 2 old columns.
households_1 <- households_1[,-c(2,4)]
Code
# Creating table 2 by removing appropriate columns.
households_2 <- households_new[,-c(2, 7:8)]

# Remove duplicate rows in table 2.
households_2 %>% distinct()
# A tibble: 340 × 5
   year   estimated_median_income median_moe estimated_mean_income mean_moe
   <chr>                    <dbl>      <dbl> <chr>                 <chr>   
 1 2019                     68703        904 98088                 1042    
 2 2018                     64324        704 91652                 914     
 3 2017 2                   63761        552 91406                 979     
 4 2017                     64007        575 89922                 892     
 5 2016                     62898        764 88578                 822     
 6 2015                     60987        570 85533                 715     
 7 2014                     58001        697 81870                 793     
 8 2013 3                   58904       1183 82660                 1201    
 9 2013 4                   57095        499 79852                 902     
10 2012                     56912        384 79510                 773     
# … with 330 more rows
# ℹ Use `print(n = ...)` to see more rows

Now we have 2 tables that are relatively easier to comprehend than what we had at the start. This is a work in progress - I want to figure out how to add the race column, and also to interpret the tables I’ve created.