Code
library(tidyverse)
library(readxl)
library(dplyr)
library(tidyr)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Saaradhaa M
August 17, 2022
I will be working with the households dataset.
# A tibble: 383 × 16
...1 ...2 Total Under…¹ $15,0…² $25,0…³ $35,0…⁴ $50,0…⁵ $75,0…⁶ $100,…⁷
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ALL RACES <NA> NA NA NA NA NA NA NA NA
2 2019 1284… 100 9.1 8 8.3 11.7 16.5 12.3 15.5
3 2018 1285… 100 10.1 8.8 8.7 12 17 12.5 15
4 2017 2 1276… 100 10 9.1 9.2 12 16.4 12.4 14.7
5 2017 1275… 100 10.1 9.1 9.2 11.9 16.3 12.6 14.8
6 2016 1262… 100 10.4 9 9.2 12.3 16.7 12.2 15
7 2015 1258… 100 10.6 10 9.6 12.1 16.1 12.4 14.9
8 2014 1245… 100 11.4 10.5 9.6 12.6 16.4 12.1 14
9 2013 3 1239… 100 11.4 10.3 9.5 12.5 16.8 12 13.9
10 2013 4 1229… 100 11.3 10.4 9.7 13.1 17 12.5 13.6
# … with 373 more rows, 6 more variables: `$150,000\r\nto\r\n$199,999` <dbl>,
# `$200,000 and over` <dbl>, Estimate...13 <dbl>,
# `Margin of error1 (±)...14` <dbl>, Estimate...15 <chr>,
# `Margin of error1 (±)...16` <chr>, and abbreviated variable names
# ¹`Under $15,000`, ²`$15,000\r\nto\r\n$24,999`, ³`$25,000\r\nto\r\n$34,999`,
# ⁴`$35,000\r\nto\r\n$49,999`, ⁵`$50,000\r\nto\r\n$74,999`,
# ⁶`$75,000\r\nto\r\n$99,999`, ⁷`$100,000\r\nto\r\n$149,999`
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
When reading in the data, I skipped the first four rows (they just describe the dataset). The dataset segments households by their income levels, race and Hispanic origin of householder from 1967 to 2019. It also has an external link to https://www2.census.gov/programs-surveys/cps/techdocs/cpsmar20.pdf, which shows that the data is part of the Annual Social and Economic Supplement in the Census. I need to re-name some headers, remove “Total” (redundant column) and remove rows 353 to 383 (they’re just notes).
[1] "...1" "...2"
[3] "Total" "Under $15,000"
[5] "$15,000\r\nto\r\n$24,999" "$25,000\r\nto\r\n$34,999"
[7] "$35,000\r\nto\r\n$49,999" "$50,000\r\nto\r\n$74,999"
[9] "$75,000\r\nto\r\n$99,999" "$100,000\r\nto\r\n$149,999"
[11] "$150,000\r\nto\r\n$199,999" "$200,000 and over"
[13] "Estimate...13" "Margin of error1 (±)...14"
[15] "Estimate...15" "Margin of error1 (±)...16"
households <- rename(households, "year" = "...1", "num_thousands" = "...2", "estimated_median_income" = "Estimate...13", "median_moe" = "Margin of error1 (±)...14", "estimated_mean_income" = "Estimate...15", "mean_moe" = "Margin of error1 (±)...16")
# Remove "Total" column.
households <- households[,-3]
# Remove rows 353-383.
households <- households[-c(353:383),]
# Which rows have missing values? This tells me how many rows of races there are.
which(rowSums(is.na(households))>0)
[1] 1 57 78 114 135 166 187 208 244 265 286 302
Now the dataset is a lot cleaner. We can see that in the “year” column, there are rows of races (N = 12). Race should actually be entered as a separate column, but I don’t know how to select specific rows in the “year” column to create a new column. I’ll remove those rows just for the purposes of working through this exercise.
[1] 340 15
The current dimensions are 340 rows and 15 columns. I would like to shift all the income categories into an “income” column, so this should give me a lot more rows and 8 columns.
# A tibble: 3,060 × 8
year num_thousands estimated_median…¹ media…² estim…³ mean_…⁴ income propo…⁵
<chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <dbl>
1 2019 128451 68703 904 98088 1042 "Unde… 9.1
2 2019 128451 68703 904 98088 1042 "$15,… 8
3 2019 128451 68703 904 98088 1042 "$25,… 8.3
4 2019 128451 68703 904 98088 1042 "$35,… 11.7
5 2019 128451 68703 904 98088 1042 "$50,… 16.5
6 2019 128451 68703 904 98088 1042 "$75,… 12.3
7 2019 128451 68703 904 98088 1042 "$100… 15.5
8 2019 128451 68703 904 98088 1042 "$150… 8.3
9 2019 128451 68703 904 98088 1042 "$200… 10.3
10 2018 128579 64324 704 91652 914 "Unde… 10.1
# … with 3,050 more rows, and abbreviated variable names
# ¹estimated_median_income, ²median_moe, ³estimated_mean_income, ⁴mean_moe,
# ⁵proportion
# ℹ Use `print(n = ...)` to see more rows
Now we suddenly have >3000 rows. This is because the columns estimated_median_income, median_moe, estimated_mean_income and mean_moe are the same for each year (regardless of income bracket, which we’ve just pivoted into a new column). So I’m going to split the data into two tables to make it easier to understand.
# Creating table 1 by removing appropriate columns.
households_1 <- households_new[,-c(3:6)]
# Changing num_thousands to numeric so that the next argument runs properly.
households_1$num_thousands <- as.numeric(households_1$num_thousands)
# Merging 2 columns into 1.
households_1 <- households_1 %>% mutate(count_thousands = `num_thousands`*(`proportion`/100))
# Removing the 2 old columns.
households_1 <- households_1[,-c(2,4)]
# A tibble: 340 × 5
year estimated_median_income median_moe estimated_mean_income mean_moe
<chr> <dbl> <dbl> <chr> <chr>
1 2019 68703 904 98088 1042
2 2018 64324 704 91652 914
3 2017 2 63761 552 91406 979
4 2017 64007 575 89922 892
5 2016 62898 764 88578 822
6 2015 60987 570 85533 715
7 2014 58001 697 81870 793
8 2013 3 58904 1183 82660 1201
9 2013 4 57095 499 79852 902
10 2012 56912 384 79510 773
# … with 330 more rows
# ℹ Use `print(n = ...)` to see more rows
Now we have 2 tables that are relatively easier to comprehend than what we had at the start. This is a work in progress - I want to figure out how to add the race column, and also to interpret the tables I’ve created.
---
title: "Challenge 3"
author: "Saaradhaa M"
desription: "Tidy Data: Pivoting"
date: "08/17/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_3
- tidyverse
- readxl
- dplyr
- tidyr
- households
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
library(readxl)
library(dplyr)
library(tidyr)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Reading in data
I will be working with the households dataset.
```{r}
#| label: read in data
# Reading in data.
households <-read_excel("_data/USA Households by Total Money Income, Race, and Hispanic Origin of Householder 1967 to 2019.xlsx", skip=4)
households
```
### Brief description of data
When reading in the data, I skipped the first four rows (they just describe the dataset). The dataset segments households by their income levels, race and Hispanic origin of householder from 1967 to 2019. It also has an external link to https://www2.census.gov/programs-surveys/cps/techdocs/cpsmar20.pdf, which shows that the data is part of the Annual Social and Economic Supplement in the Census. I need to re-name some headers, remove "Total" (redundant column) and remove rows 353 to 383 (they're just notes).
```{r}
#| label: cleaning
# Rename column headers.
colnames(households)
households <- rename(households, "year" = "...1", "num_thousands" = "...2", "estimated_median_income" = "Estimate...13", "median_moe" = "Margin of error1 (±)...14", "estimated_mean_income" = "Estimate...15", "mean_moe" = "Margin of error1 (±)...16")
# Remove "Total" column.
households <- households[,-3]
# Remove rows 353-383.
households <- households[-c(353:383),]
# Which rows have missing values? This tells me how many rows of races there are.
which(rowSums(is.na(households))>0)
```
## Anticipate end result and find current and future data dimensions.
Now the dataset is a lot cleaner. We can see that in the "year" column, there are rows of races (*N* = 12). Race should actually be entered as a separate column, but **I don't know how to select specific rows in the "year" column to create a new column**. I'll remove those rows just for the purposes of working through this exercise.
```{r}
#| label: cleaning 2
# Remove race rows.
households_new <- households[-c(1,57,78,114,135,166,187,208,244,265,286,302),]
dim(households_new)
```
The current dimensions are 340 rows and 15 columns. I would like to shift all the income categories into an "income" column, so this should give me a lot more rows and 8 columns.
### Pivoting dataset
```{r}
#| label: pivot
# Attempt pivotlonger().
households_new <- pivot_longer(households_new, cols = contains("$"), names_to = "income", values_to = "proportion")
households_new
```
Now we suddenly have \>3000 rows. This is because the columns estimated_median_income, median_moe, estimated_mean_income and mean_moe are the same for each year (regardless of income bracket, which we've just pivoted into a new column). So I'm going to split the data into two tables to make it easier to understand.
```{r}
# Creating table 1 by removing appropriate columns.
households_1 <- households_new[,-c(3:6)]
# Changing num_thousands to numeric so that the next argument runs properly.
households_1$num_thousands <- as.numeric(households_1$num_thousands)
# Merging 2 columns into 1.
households_1 <- households_1 %>% mutate(count_thousands = `num_thousands`*(`proportion`/100))
# Removing the 2 old columns.
households_1 <- households_1[,-c(2,4)]
```
```{r}
# Creating table 2 by removing appropriate columns.
households_2 <- households_new[,-c(2, 7:8)]
# Remove duplicate rows in table 2.
households_2 %>% distinct()
```
Now we have 2 tables that are relatively easier to comprehend than what we had at the start. This is a work in progress - I want to figure out how to add the race column, and also to interpret the tables I've created.