Code
library(tidyverse)
library(readxl)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Matthew Norberg
September 26, 2022
Today’s challenge is to:
pivot_longer
Read in one (or more) of the following datasets, using the correct R package and command.
mdf <- read_excel('./_data/USA Households by Total Money Income, Race, and Hispanic Origin of Householder 1967 to 2019.xlsx', skip = 4)
mdf <- select(mdf, -(Total)) %>%
rename(Number = ...2,
MedianEstimate = Estimate...13,
MeanEstimate = Estimate...15,
MedianEstimateMOE = `Margin of error1 (±)...14`,
MeanEstimateMOE = `Margin of error1 (±)...16`) %>%
head(-31)
head(mdf)
# A tibble: 6 × 15
...1 Number Under…¹ $15,0…² $25,0…³ $35,0…⁴ $50,0…⁵ $75,0…⁶ $100,…⁷ $150,…⁸
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ALL RA… <NA> NA NA NA NA NA NA NA NA
2 2019 128451 9.1 8 8.3 11.7 16.5 12.3 15.5 8.3
3 2018 128579 10.1 8.8 8.7 12 17 12.5 15 7.2
4 2017 2 127669 10 9.1 9.2 12 16.4 12.4 14.7 7.3
5 2017 127586 10.1 9.1 9.2 11.9 16.3 12.6 14.8 7.5
6 2016 126224 10.4 9 9.2 12.3 16.7 12.2 15 7.2
# … with 5 more variables: `$200,000 and over` <dbl>, MedianEstimate <dbl>,
# MedianEstimateMOE <dbl>, MeanEstimate <chr>, MeanEstimateMOE <chr>, and
# abbreviated variable names ¹`Under $15,000`, ²`$15,000\r\nto\r\n$24,999`,
# ³`$25,000\r\nto\r\n$34,999`, ⁴`$35,000\r\nto\r\n$49,999`,
# ⁵`$50,000\r\nto\r\n$74,999`, ⁶`$75,000\r\nto\r\n$99,999`,
# ⁷`$100,000\r\nto\r\n$149,999`, ⁸`$150,000\r\nto\r\n$199,999`
Describe the data, and be sure to comment on why you are planning to pivot it to make it “tidy”
The data set looks at total income by race and Hispanic origin of United States Households from 1967 to 2019. The initial representation of the data is unorganized. An arbitrary row describes total income for a given race and Hispanic origin by year. Note that a row represents a population of households, not just one.
The first column of the data set provides the Race and Hispanic Origin as well as the year. This format is readable in an excel sheet, but it is not tidy. The second column describes the total number of households in the row. The next group of columns is used to split the population into a number of different income groups. The value in any of these columns tells us what percent of the population (column 2) falls into that income group. For instance, if there is a value of 10 in the ‘Under $15,000’ column, it means that 10% of that population makes less than $15,000. Finally, the last 2 groups of columns describe the median income and mean income of the group respectively. Additionally, there are columns to describe how good the median income and mean income estimates are.
The first problem with the data set is the leftmost column. In the excel file, it is being used to describe two different parameters: Race/Origin and Year. This is not tidy data because it is describing more than one parameter.
The first step in pivoting the data is to try to come up with a concrete vision of what the end product should look like - that way you will know whether or not your pivoting was successful.
One easy way to do this is to think about the dimensions of your current data (tibble, dataframe, or matrix), and then calculate what the dimensions of the pivoted data should be.
Suppose you have a dataset with \(n\) rows and \(k\) variables. In our example, 3 of the variables are used to identify a case, so you will be pivoting \(k-3\) variables into a longer format where the \(k-3\) variable names will move into the names_to
variable and the current values in each of those columns will move into the values_to
variable. Therefore, we would expect \(n * (k-3)\) rows in the pivoted dataframe!
Lets see if this works with a simple example.
# A tibble: 6 × 5
country year trade outgoing incoming
<chr> <dbl> <chr> <dbl> <dbl>
1 Mexico 1980 NAFTA 67.1 1312.
2 USA 1990 NAFTA 1131. 1398.
3 France 1980 EU 340. 1734.
4 Mexico 1990 NAFTA 89.4 1869.
5 USA 1980 NAFTA 205. 2140.
6 France 1990 EU 1010. 888.
[1] 6
[1] 5
[1] 12
[1] 5
Or simple example has \(n = 6\) rows and \(k - 3 = 2\) variables being pivoted, so we expect a new dataframe to have \(n * 2 = 12\) rows x \(3 + 2 = 5\) columns.
Document your work here.
To make this tidy, I would like to to make one column which describes the race and another column which describes the year. This was done in two steps. First, I cleaned the column to get rid of footnotes and excess information that was present in the excel file. Next, I created a vector containing the race information for each row such that the ith item in the vector corresponded to the ith row of the tibble. I added this vector to the tibble as the Race Column and then removed the values that contain NA to visualize the result. The code is placed below.
# Clean the leftmost column
c1 <- mdf$...1
for (i in seq_along(c1)) {
item <- c1[[i]]
# Case where item is a category containing a number -> "WHITE ALONE 24"
if (!(is.na(slice(mdf, i)$Number)) & str_detect(item, " ")) {
pieces <- str_split(item, " ")[[1]]
c1[i] = pieces[1]
}
# Case where item is has superscript -> "2017 2"
else if (is.na(slice(mdf, i)$Number)){
c1[i] = str_squish(gsub('[0-9]+', '', item))
}
}
# Reset leftmost column
mdf$...1 <- c1
# Create a new column describing race information
c2 <- mdf$...1
category <- c2[[1]]
for (i in seq_along(c2)) {
item <- c2[[i]]
if (!(str_detect(item, "[0-9][0-9][0-9][0-9]")))
category <- item
c2[i] = category
}
# Add new column to tibble, remove old column
mdf <- mutate(mdf, `Race/Hispanic` = c2) %>%
relocate(`Race/Hispanic`, .after = `...1`) %>%
rename(Year = `...1`)
# Remove old rows
mdf <- na.omit(mdf)
head(mdf, n = 15)
# A tibble: 15 × 16
Year Race/H…¹ Number Under…² $15,0…³ $25,0…⁴ $35,0…⁵ $50,0…⁶ $75,0…⁷ $100,…⁸
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2019 ALL RAC… 128451 9.1 8 8.3 11.7 16.5 12.3 15.5
2 2018 ALL RAC… 128579 10.1 8.8 8.7 12 17 12.5 15
3 2017 ALL RAC… 127669 10 9.1 9.2 12 16.4 12.4 14.7
4 2017 ALL RAC… 127586 10.1 9.1 9.2 11.9 16.3 12.6 14.8
5 2016 ALL RAC… 126224 10.4 9 9.2 12.3 16.7 12.2 15
6 2015 ALL RAC… 125819 10.6 10 9.6 12.1 16.1 12.4 14.9
7 2014 ALL RAC… 124587 11.4 10.5 9.6 12.6 16.4 12.1 14
8 2013 ALL RAC… 123931 11.4 10.3 9.5 12.5 16.8 12 13.9
9 2013 ALL RAC… 122952 11.3 10.4 9.7 13.1 17 12.5 13.6
10 2012 ALL RAC… 122459 11.4 10.6 10.1 12.5 17.4 12 13.9
11 2011 ALL RAC… 121084 11.6 10.2 10.2 13.1 17.2 11.9 13.8
12 2010 ALL RAC… 119927 11.2 10.7 9.4 13.3 16.8 12.4 14.1
13 2009 ALL RAC… 117538 10.4 10 9.7 13.2 17.4 12.4 14.5
14 2008 ALL RAC… 117181 10.4 10 9.4 13.4 17 12.5 15
15 2007 ALL RAC… 116783 10 10 9 12.8 17.3 12.6 15.3
# … with 6 more variables: `$150,000\r\nto\r\n$199,999` <dbl>,
# `$200,000 and over` <dbl>, MedianEstimate <dbl>, MedianEstimateMOE <dbl>,
# MeanEstimate <chr>, MeanEstimateMOE <chr>, and abbreviated variable names
# ¹`Race/Hispanic`, ²`Under $15,000`, ³`$15,000\r\nto\r\n$24,999`,
# ⁴`$25,000\r\nto\r\n$34,999`, ⁵`$35,000\r\nto\r\n$49,999`,
# ⁶`$50,000\r\nto\r\n$74,999`, ⁷`$75,000\r\nto\r\n$99,999`,
# ⁸`$100,000\r\nto\r\n$149,999`
Any additional comments?
I like the way that the data frame is structured by the code above and I would argue that it does not need to be pivoted any further. A case is made up by the pairing of Year and Race/Hispanic columns. The view of the data created above clearly shows a case on each row which makes it easy to look at the data. At this point, if you pivot the data to make it longer, you get a lot of repetition across rows. If you pivot the table to make it wider, you end up with many values that are not available.
If I were to rearrange the tibble, I would probably sort the rows by Year so that you can compare different Race/Hispanic values across multiple years. However, I do think there are a few different ways you can pivot the table to view it differently (although I don’t think the views are as organized due to data repetition). These alternate views are presented below in the pivot the data section.
Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a “sanity” check.
# A tibble: 12 × 5
country year trade trade_direction trade_value
<chr> <dbl> <chr> <chr> <dbl>
1 Mexico 1980 NAFTA outgoing 67.1
2 Mexico 1980 NAFTA incoming 1312.
3 USA 1990 NAFTA outgoing 1131.
4 USA 1990 NAFTA incoming 1398.
5 France 1980 EU outgoing 340.
6 France 1980 EU incoming 1734.
7 Mexico 1990 NAFTA outgoing 89.4
8 Mexico 1990 NAFTA incoming 1869.
9 USA 1980 NAFTA outgoing 205.
10 USA 1980 NAFTA incoming 2140.
11 France 1990 EU outgoing 1010.
12 France 1990 EU incoming 888.
Yes, once it is pivoted long, our resulting data are \(12x5\) - exactly what we expected!
Document your work here. What will a new “case” be once you have pivoted the data? How does it meet requirements for tidy data?
One interesting way to rearrange the data would be to combine the “MedianEstimate”, “MeanEstimate”, “MedianEstimateMOE”, and “MeanEstimateMOE” columns into one to make the tibble longer. Before we can do that, we need to convert the “MeanEstimate” and “MeanEstimateMOE” columns to have a
Now we can pivot the tibble to make it longer. As a sanity check, a case would be 4 variables. There are currently 340 rows in the data frame. Therefore, when we pivot the tibble, we expect there to be 340 * 4 = 1,360 rows in the resulting data set. This is due to the fact that there will be four more rows for each row that was in the dataset before pivoting.
# A tibble: 6 × 14
Year Race/Hi…¹ Number Under…² $15,0…³ $25,0…⁴ $35,0…⁵ $50,0…⁶ $75,0…⁷ $100,…⁸
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2019 ALL RACES 128451 9.1 8 8.3 11.7 16.5 12.3 15.5
2 2019 ALL RACES 128451 9.1 8 8.3 11.7 16.5 12.3 15.5
3 2019 ALL RACES 128451 9.1 8 8.3 11.7 16.5 12.3 15.5
4 2019 ALL RACES 128451 9.1 8 8.3 11.7 16.5 12.3 15.5
5 2018 ALL RACES 128579 10.1 8.8 8.7 12 17 12.5 15
6 2018 ALL RACES 128579 10.1 8.8 8.7 12 17 12.5 15
# … with 4 more variables: `$150,000\r\nto\r\n$199,999` <dbl>,
# `$200,000 and over` <dbl>, EstimateType <chr>, EstimateValue <dbl>, and
# abbreviated variable names ¹`Race/Hispanic`, ²`Under $15,000`,
# ³`$15,000\r\nto\r\n$24,999`, ⁴`$25,000\r\nto\r\n$34,999`,
# ⁵`$35,000\r\nto\r\n$49,999`, ⁶`$50,000\r\nto\r\n$74,999`,
# ⁷`$75,000\r\nto\r\n$99,999`, ⁸`$100,000\r\nto\r\n$149,999`
One more view that is interesting occurs when you pivot the columns that describe the percent of the population that falls into that income group. The code to do that is provided below.
# A tibble: 6 × 9
Year `Race/Hispanic` Number MedianE…¹ Media…² MeanE…³ MeanE…⁴ Incom…⁵ Perce…⁶
<chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <dbl>
1 2019 ALL RACES 128451 68703 904 98088 1042 "Under… 9.1
2 2019 ALL RACES 128451 68703 904 98088 1042 "$15,0… 8
3 2019 ALL RACES 128451 68703 904 98088 1042 "$25,0… 8.3
4 2019 ALL RACES 128451 68703 904 98088 1042 "$35,0… 11.7
5 2019 ALL RACES 128451 68703 904 98088 1042 "$50,0… 16.5
6 2019 ALL RACES 128451 68703 904 98088 1042 "$75,0… 12.3
# … with abbreviated variable names ¹MedianEstimate, ²MedianEstimateMOE,
# ³MeanEstimate, ⁴MeanEstimateMOE, ⁵IncomeType, ⁶Percentage
Any additional comments?
---
title: "Challenge 3 Instructions"
author: "Matthew Norberg"
desription: "Tidy Data: Pivoting"
date: "09/26/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_3
- animal_weights
- eggs
- australian_marriage
- usa_households
- sce_labor
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
library(readxl)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to:
1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
2. identify what needs to be done to tidy the current data
3. anticipate the shape of pivoted data
4. pivot the data into tidy format using `pivot_longer`
## Read in data
Read in one (or more) of the following datasets, using the correct R package and command.
- animal_weights.csv ⭐
- eggs_tidy.csv ⭐⭐ or organiceggpoultry.xls ⭐⭐⭐
- australian_marriage\*.xls ⭐⭐⭐
- USA Households\*.xlsx ⭐⭐⭐⭐
- sce_labor_chart_data_public.xlsx 🌟🌟🌟🌟🌟
```{r}
mdf <- read_excel('./_data/USA Households by Total Money Income, Race, and Hispanic Origin of Householder 1967 to 2019.xlsx', skip = 4)
mdf <- select(mdf, -(Total)) %>%
rename(Number = ...2,
MedianEstimate = Estimate...13,
MeanEstimate = Estimate...15,
MedianEstimateMOE = `Margin of error1 (±)...14`,
MeanEstimateMOE = `Margin of error1 (±)...16`) %>%
head(-31)
head(mdf)
```
### Briefly describe the data
Describe the data, and be sure to comment on why you are planning to pivot it to make it "tidy"
The data set looks at total income by race and Hispanic origin of United States Households from 1967 to 2019. The initial representation of the data is unorganized. An arbitrary row describes total income for a given race and Hispanic origin by year. Note that a row represents a population of households, not just one.
The first column of the data set provides the Race and Hispanic Origin as well as the year. This format is readable in an excel sheet, but it is not tidy. The second column describes the total number of households in the row. The next group of columns is used to split the population into a number of different income groups. The value in any of these columns tells us what percent of the population (column 2) falls into that income group. For instance, if there is a value of 10 in the 'Under \$15,000' column, it means that 10% of that population makes less than \$15,000. Finally, the last 2 groups of columns describe the median income and mean income of the group respectively. Additionally, there are columns to describe how good the median income and mean income estimates are.
The first problem with the data set is the leftmost column. In the excel file, it is being used to describe two different parameters: Race/Origin and Year. This is not tidy data because it is describing more than one parameter.
## Anticipate the End Result
The first step in pivoting the data is to try to come up with a concrete vision of what the end product *should* look like - that way you will know whether or not your pivoting was successful.
One easy way to do this is to think about the dimensions of your current data (tibble, dataframe, or matrix), and then calculate what the dimensions of the pivoted data should be.
Suppose you have a dataset with $n$ rows and $k$ variables. In our example, 3 of the variables are used to identify a case, so you will be pivoting $k-3$ variables into a longer format where the $k-3$ variable names will move into the `names_to` variable and the current values in each of those columns will move into the `values_to` variable. Therefore, we would expect $n * (k-3)$ rows in the pivoted dataframe!
### Example: find current and future data dimensions
Lets see if this works with a simple example.
```{r}
#| tbl-cap: Example
df<-tibble(country = rep(c("Mexico", "USA", "France"),2),
year = rep(c(1980,1990), 3),
trade = rep(c("NAFTA", "NAFTA", "EU"),2),
outgoing = rnorm(6, mean=1000, sd=500),
incoming = rlogis(6, location=1000,
scale = 400))
df
#existing rows/cases
nrow(df)
#existing columns/cases
ncol(df)
#expected rows/cases
nrow(df) * (ncol(df)-3)
# expected columns
3 + 2
```
Or simple example has $n = 6$ rows and $k - 3 = 2$ variables being pivoted, so we expect a new dataframe to have $n * 2 = 12$ rows x $3 + 2 = 5$ columns.
### Challenge: Describe the final dimensions
Document your work here.
To make this tidy, I would like to to make one column which describes the race and another column which describes the year. This was done in two steps. First, I cleaned the column to get rid of footnotes and excess information that was present in the excel file. Next, I created a vector containing the race information for each row such that the ith item in the vector corresponded to the ith row of the tibble. I added this vector to the tibble as the Race Column and then removed the values that contain NA to visualize the result. The code is placed below.
```{r}
# Clean the leftmost column
c1 <- mdf$...1
for (i in seq_along(c1)) {
item <- c1[[i]]
# Case where item is a category containing a number -> "WHITE ALONE 24"
if (!(is.na(slice(mdf, i)$Number)) & str_detect(item, " ")) {
pieces <- str_split(item, " ")[[1]]
c1[i] = pieces[1]
}
# Case where item is has superscript -> "2017 2"
else if (is.na(slice(mdf, i)$Number)){
c1[i] = str_squish(gsub('[0-9]+', '', item))
}
}
# Reset leftmost column
mdf$...1 <- c1
# Create a new column describing race information
c2 <- mdf$...1
category <- c2[[1]]
for (i in seq_along(c2)) {
item <- c2[[i]]
if (!(str_detect(item, "[0-9][0-9][0-9][0-9]")))
category <- item
c2[i] = category
}
# Add new column to tibble, remove old column
mdf <- mutate(mdf, `Race/Hispanic` = c2) %>%
relocate(`Race/Hispanic`, .after = `...1`) %>%
rename(Year = `...1`)
# Remove old rows
mdf <- na.omit(mdf)
head(mdf, n = 15)
```
Any additional comments?
I like the way that the data frame is structured by the code above and I would argue that it does not need to be pivoted any further. A case is made up by the pairing of Year and Race/Hispanic columns. The view of the data created above clearly shows a case on each row which makes it easy to look at the data. At this point, if you pivot the data to make it longer, you get a lot of repetition across rows. If you pivot the table to make it wider, you end up with many values that are not available.
If I were to rearrange the tibble, I would probably sort the rows by Year so that you can compare different Race/Hispanic values across multiple years. However, I do think there are a few different ways you can pivot the table to view it differently (although I don't think the views are as organized due to data repetition). These alternate views are presented below in the pivot the data section.
## Pivot the Data
Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a "sanity" check.
### Example
```{r}
#| tbl-cap: Pivoted Example
df<-pivot_longer(df, col = c(outgoing, incoming),
names_to="trade_direction",
values_to = "trade_value")
df
```
Yes, once it is pivoted long, our resulting data are $12x5$ - exactly what we expected!
### Challenge: Pivot the Chosen Data
Document your work here. What will a new "case" be once you have pivoted the data? How does it meet requirements for tidy data?
One interesting way to rearrange the data would be to combine the "MedianEstimate", "MeanEstimate", "MedianEstimateMOE", and "MeanEstimateMOE" columns into one to make the tibble longer. Before we can do that, we need to convert the "MeanEstimate" and "MeanEstimateMOE" columns to have a <double> type. Currently, they are character types because there are a few entries in those columns that have a value of "N" for some reason. The code to do that is placed below.
```{r}
# Remove "N"
c1 = mdf$MeanEstimate
for (i in seq_along(c1)) {
item <- c1[[i]]
if (item == "N")
c1[i] = "-1"
}
c2 = mdf$MeanEstimateMOE
for (i in seq_along(c2)) {
item <- c2[[i]]
if (item == "N")
c2[i] = "-1"
}
test <- mdf
test$MeanEstimate = c1
test$MeanEstimateMOE = c2
test <- type_convert(test)
```
Now we can pivot the tibble to make it longer. As a sanity check, a case would be 4 variables. There are currently 340 rows in the data frame. Therefore, when we pivot the tibble, we expect there to be 340 * 4 = 1,360 rows in the resulting data set. This is due to the fact that there will be four more rows for each row that was in the dataset before pivoting.
```{r}
pivotDF <- pivot_longer(test, cols = c("MedianEstimate", "MeanEstimate", "MedianEstimateMOE", "MeanEstimateMOE"), names_to = "EstimateType", values_to = "EstimateValue")
head(pivotDF)
```
One more view that is interesting occurs when you pivot the columns that describe the percent of the population that falls into that income group. The code to do that is provided below.
```{r}
pivotDF <- pivot_longer(mdf, cols = 4:12, names_to = "IncomeType", values_to = "Percentage")
head(pivotDF)
```
Any additional comments?