Challenge 3 Instructions

challenge_3

animal_weights

eggs

australian_marriage

usa_households

sce_labor

Author

Matthew Norberg

Published

September 26, 2022

Code

library(tidyverse)
library(readxl)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
identify what needs to be done to tidy the current data
anticipate the shape of pivoted data
pivot the data into tidy format using pivot_longer

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

animal_weights.csv ⭐
eggs_tidy.csv ⭐⭐ or organiceggpoultry.xls ⭐⭐⭐
australian_marriage*.xls ⭐⭐⭐
USA Households*.xlsx ⭐⭐⭐⭐
sce_labor_chart_data_public.xlsx 🌟🌟🌟🌟🌟

Code

mdf <- read_excel('./_data/USA Households by Total Money Income, Race, and Hispanic Origin of Householder 1967 to 2019.xlsx', skip = 4)
mdf <- select(mdf, -(Total)) %>%
       rename(Number = ...2,
              MedianEstimate = Estimate...13, 
              MeanEstimate = Estimate...15,
              MedianEstimateMOE = `Margin of error1 (±)...14`,
              MeanEstimateMOE = `Margin of error1 (±)...16`) %>%
       head(-31)
head(mdf)

# A tibble: 6 × 15
  ...1    Number Under…¹ $15,0…² $25,0…³ $35,0…⁴ $50,0…⁵ $75,0…⁶ $100,…⁷ $150,…⁸
  <chr>   <chr>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1 ALL RA… <NA>      NA      NA      NA      NA      NA      NA      NA      NA  
2 2019    128451     9.1     8       8.3    11.7    16.5    12.3    15.5     8.3
3 2018    128579    10.1     8.8     8.7    12      17      12.5    15       7.2
4 2017 2  127669    10       9.1     9.2    12      16.4    12.4    14.7     7.3
5 2017    127586    10.1     9.1     9.2    11.9    16.3    12.6    14.8     7.5
6 2016    126224    10.4     9       9.2    12.3    16.7    12.2    15       7.2
# … with 5 more variables: `$200,000 and over` <dbl>, MedianEstimate <dbl>,
#   MedianEstimateMOE <dbl>, MeanEstimate <chr>, MeanEstimateMOE <chr>, and
#   abbreviated variable names ¹`Under $15,000`, ²`$15,000\r\nto\r\n$24,999`,
#   ³`$25,000\r\nto\r\n$34,999`, ⁴`$35,000\r\nto\r\n$49,999`,
#   ⁵`$50,000\r\nto\r\n$74,999`, ⁶`$75,000\r\nto\r\n$99,999`,
#   ⁷`$100,000\r\nto\r\n$149,999`, ⁸`$150,000\r\nto\r\n$199,999`

Briefly describe the data

Describe the data, and be sure to comment on why you are planning to pivot it to make it “tidy”

The data set looks at total income by race and Hispanic origin of United States Households from 1967 to 2019. The initial representation of the data is unorganized. An arbitrary row describes total income for a given race and Hispanic origin by year. Note that a row represents a population of households, not just one.

The first column of the data set provides the Race and Hispanic Origin as well as the year. This format is readable in an excel sheet, but it is not tidy. The second column describes the total number of households in the row. The next group of columns is used to split the population into a number of different income groups. The value in any of these columns tells us what percent of the population (column 2) falls into that income group. For instance, if there is a value of 10 in the ‘Under $15,000’ column, it means that 10% of that population makes less than $15,000. Finally, the last 2 groups of columns describe the median income and mean income of the group respectively. Additionally, there are columns to describe how good the median income and mean income estimates are.

The first problem with the data set is the leftmost column. In the excel file, it is being used to describe two different parameters: Race/Origin and Year. This is not tidy data because it is describing more than one parameter.

Anticipate the End Result

The first step in pivoting the data is to try to come up with a concrete vision of what the end product should look like - that way you will know whether or not your pivoting was successful.

One easy way to do this is to think about the dimensions of your current data (tibble, dataframe, or matrix), and then calculate what the dimensions of the pivoted data should be.

Suppose you have a dataset with $n$ rows and $k$ variables. In our example, 3 of the variables are used to identify a case, so you will be pivoting $k-3$ variables into a longer format where the $k-3$ variable names will move into the names_to variable and the current values in each of those columns will move into the values_to variable. Therefore, we would expect $n * (k-3)$ rows in the pivoted dataframe!

Example: find current and future data dimensions

Lets see if this works with a simple example.

Code

df<-tibble(country = rep(c("Mexico", "USA", "France"),2),
           year = rep(c(1980,1990), 3), 
           trade = rep(c("NAFTA", "NAFTA", "EU"),2),
           outgoing = rnorm(6, mean=1000, sd=500),
           incoming = rlogis(6, location=1000, 
                             scale = 400))
df

# A tibble: 6 × 5
  country  year trade outgoing incoming
  <chr>   <dbl> <chr>    <dbl>    <dbl>
1 Mexico   1980 NAFTA     67.1    1312.
2 USA      1990 NAFTA   1131.     1398.
3 France   1980 EU       340.     1734.
4 Mexico   1990 NAFTA     89.4    1869.
5 USA      1980 NAFTA    205.     2140.
6 France   1990 EU      1010.      888.

Code

#existing rows/cases
nrow(df)

[1] 6

Code

#existing columns/cases
ncol(df)

[1] 5

Code

#expected rows/cases
nrow(df) * (ncol(df)-3)

[1] 12

Code

# expected columns 
3 + 2

[1] 5

Or simple example has $n = 6$ rows and $k - 3 = 2$ variables being pivoted, so we expect a new dataframe to have $n * 2 = 12$ rows x $3 + 2 = 5$ columns.

Challenge: Describe the final dimensions

Document your work here.

To make this tidy, I would like to to make one column which describes the race and another column which describes the year. This was done in two steps. First, I cleaned the column to get rid of footnotes and excess information that was present in the excel file. Next, I created a vector containing the race information for each row such that the ith item in the vector corresponded to the ith row of the tibble. I added this vector to the tibble as the Race Column and then removed the values that contain NA to visualize the result. The code is placed below.

Code

# Clean the leftmost column
c1 <- mdf$...1
for (i in seq_along(c1)) {
  item <- c1[[i]]
  # Case where item is a category containing a number -> "WHITE ALONE 24"
  if (!(is.na(slice(mdf, i)$Number)) & str_detect(item, " ")) {
    pieces <- str_split(item, " ")[[1]]
    c1[i] = pieces[1]
  }
  # Case where item is has superscript -> "2017 2"
  else if (is.na(slice(mdf, i)$Number)){
    c1[i] = str_squish(gsub('[0-9]+', '', item))
  }
}
# Reset leftmost column 
mdf$...1 <- c1

# Create a new column describing race information
c2 <- mdf$...1
category <- c2[[1]]
for (i in seq_along(c2)) {
  item <- c2[[i]]
  if (!(str_detect(item, "[0-9][0-9][0-9][0-9]")))
      category <- item
  c2[i] = category
}
# Add new column to tibble, remove old column
mdf <- mutate(mdf, `Race/Hispanic` = c2) %>% 
       relocate(`Race/Hispanic`, .after = `...1`) %>%
       rename(Year = `...1`)
# Remove old rows
mdf <- na.omit(mdf)
head(mdf, n = 15)

# A tibble: 15 × 16
   Year  Race/H…¹ Number Under…² $15,0…³ $25,0…⁴ $35,0…⁵ $50,0…⁶ $75,0…⁷ $100,…⁸
   <chr> <chr>    <chr>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 2019  ALL RAC… 128451     9.1     8       8.3    11.7    16.5    12.3    15.5
 2 2018  ALL RAC… 128579    10.1     8.8     8.7    12      17      12.5    15  
 3 2017  ALL RAC… 127669    10       9.1     9.2    12      16.4    12.4    14.7
 4 2017  ALL RAC… 127586    10.1     9.1     9.2    11.9    16.3    12.6    14.8
 5 2016  ALL RAC… 126224    10.4     9       9.2    12.3    16.7    12.2    15  
 6 2015  ALL RAC… 125819    10.6    10       9.6    12.1    16.1    12.4    14.9
 7 2014  ALL RAC… 124587    11.4    10.5     9.6    12.6    16.4    12.1    14  
 8 2013  ALL RAC… 123931    11.4    10.3     9.5    12.5    16.8    12      13.9
 9 2013  ALL RAC… 122952    11.3    10.4     9.7    13.1    17      12.5    13.6
10 2012  ALL RAC… 122459    11.4    10.6    10.1    12.5    17.4    12      13.9
11 2011  ALL RAC… 121084    11.6    10.2    10.2    13.1    17.2    11.9    13.8
12 2010  ALL RAC… 119927    11.2    10.7     9.4    13.3    16.8    12.4    14.1
13 2009  ALL RAC… 117538    10.4    10       9.7    13.2    17.4    12.4    14.5
14 2008  ALL RAC… 117181    10.4    10       9.4    13.4    17      12.5    15  
15 2007  ALL RAC… 116783    10      10       9      12.8    17.3    12.6    15.3
# … with 6 more variables: `$150,000\r\nto\r\n$199,999` <dbl>,
#   `$200,000 and over` <dbl>, MedianEstimate <dbl>, MedianEstimateMOE <dbl>,
#   MeanEstimate <chr>, MeanEstimateMOE <chr>, and abbreviated variable names
#   ¹`Race/Hispanic`, ²`Under $15,000`, ³`$15,000\r\nto\r\n$24,999`,
#   ⁴`$25,000\r\nto\r\n$34,999`, ⁵`$35,000\r\nto\r\n$49,999`,
#   ⁶`$50,000\r\nto\r\n$74,999`, ⁷`$75,000\r\nto\r\n$99,999`,
#   ⁸`$100,000\r\nto\r\n$149,999`

Any additional comments?

I like the way that the data frame is structured by the code above and I would argue that it does not need to be pivoted any further. A case is made up by the pairing of Year and Race/Hispanic columns. The view of the data created above clearly shows a case on each row which makes it easy to look at the data. At this point, if you pivot the data to make it longer, you get a lot of repetition across rows. If you pivot the table to make it wider, you end up with many values that are not available.

If I were to rearrange the tibble, I would probably sort the rows by Year so that you can compare different Race/Hispanic values across multiple years. However, I do think there are a few different ways you can pivot the table to view it differently (although I don’t think the views are as organized due to data repetition). These alternate views are presented below in the pivot the data section.

Pivot the Data

Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a “sanity” check.

Example

Code

df<-pivot_longer(df, col = c(outgoing, incoming),
                 names_to="trade_direction",
                 values_to = "trade_value")
df

# A tibble: 12 × 5
   country  year trade trade_direction trade_value
   <chr>   <dbl> <chr> <chr>                 <dbl>
 1 Mexico   1980 NAFTA outgoing               67.1
 2 Mexico   1980 NAFTA incoming             1312. 
 3 USA      1990 NAFTA outgoing             1131. 
 4 USA      1990 NAFTA incoming             1398. 
 5 France   1980 EU    outgoing              340. 
 6 France   1980 EU    incoming             1734. 
 7 Mexico   1990 NAFTA outgoing               89.4
 8 Mexico   1990 NAFTA incoming             1869. 
 9 USA      1980 NAFTA outgoing              205. 
10 USA      1980 NAFTA incoming             2140. 
11 France   1990 EU    outgoing             1010. 
12 France   1990 EU    incoming              888.

Yes, once it is pivoted long, our resulting data are $12x5$ - exactly what we expected!

Challenge: Pivot the Chosen Data

Document your work here. What will a new “case” be once you have pivoted the data? How does it meet requirements for tidy data?

One interesting way to rearrange the data would be to combine the “MedianEstimate”, “MeanEstimate”, “MedianEstimateMOE”, and “MeanEstimateMOE” columns into one to make the tibble longer. Before we can do that, we need to convert the “MeanEstimate” and “MeanEstimateMOE” columns to have a type. Currently, they are character types because there are a few entries in those columns that have a value of “N” for some reason. The code to do that is placed below.

Code

# Remove "N"
c1 = mdf$MeanEstimate
for (i in seq_along(c1)) {
  item <- c1[[i]]
  if (item == "N")
    c1[i] = "-1"
}

c2 = mdf$MeanEstimateMOE
for (i in seq_along(c2)) {
  item <- c2[[i]]
  if (item == "N")
    c2[i] = "-1"
}

test <- mdf
test$MeanEstimate = c1
test$MeanEstimateMOE = c2
test <- type_convert(test)

Now we can pivot the tibble to make it longer. As a sanity check, a case would be 4 variables. There are currently 340 rows in the data frame. Therefore, when we pivot the tibble, we expect there to be 340 * 4 = 1,360 rows in the resulting data set. This is due to the fact that there will be four more rows for each row that was in the dataset before pivoting.

Code

pivotDF <- pivot_longer(test, cols = c("MedianEstimate", "MeanEstimate", "MedianEstimateMOE", "MeanEstimateMOE"), names_to = "EstimateType", values_to = "EstimateValue")
head(pivotDF)

# A tibble: 6 × 14
   Year Race/Hi…¹ Number Under…² $15,0…³ $25,0…⁴ $35,0…⁵ $50,0…⁶ $75,0…⁷ $100,…⁸
  <dbl> <chr>     <chr>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1  2019 ALL RACES 128451     9.1     8       8.3    11.7    16.5    12.3    15.5
2  2019 ALL RACES 128451     9.1     8       8.3    11.7    16.5    12.3    15.5
3  2019 ALL RACES 128451     9.1     8       8.3    11.7    16.5    12.3    15.5
4  2019 ALL RACES 128451     9.1     8       8.3    11.7    16.5    12.3    15.5
5  2018 ALL RACES 128579    10.1     8.8     8.7    12      17      12.5    15  
6  2018 ALL RACES 128579    10.1     8.8     8.7    12      17      12.5    15  
# … with 4 more variables: `$150,000\r\nto\r\n$199,999` <dbl>,
#   `$200,000 and over` <dbl>, EstimateType <chr>, EstimateValue <dbl>, and
#   abbreviated variable names ¹`Race/Hispanic`, ²`Under $15,000`,
#   ³`$15,000\r\nto\r\n$24,999`, ⁴`$25,000\r\nto\r\n$34,999`,
#   ⁵`$35,000\r\nto\r\n$49,999`, ⁶`$50,000\r\nto\r\n$74,999`,
#   ⁷`$75,000\r\nto\r\n$99,999`, ⁸`$100,000\r\nto\r\n$149,999`

One more view that is interesting occurs when you pivot the columns that describe the percent of the population that falls into that income group. The code to do that is provided below.

Code

pivotDF <- pivot_longer(mdf, cols = 4:12, names_to = "IncomeType", values_to = "Percentage")
head(pivotDF)

# A tibble: 6 × 9
  Year  `Race/Hispanic` Number MedianE…¹ Media…² MeanE…³ MeanE…⁴ Incom…⁵ Perce…⁶
  <chr> <chr>           <chr>      <dbl>   <dbl> <chr>   <chr>   <chr>     <dbl>
1 2019  ALL RACES       128451     68703     904 98088   1042    "Under…     9.1
2 2019  ALL RACES       128451     68703     904 98088   1042    "$15,0…     8  
3 2019  ALL RACES       128451     68703     904 98088   1042    "$25,0…     8.3
4 2019  ALL RACES       128451     68703     904 98088   1042    "$35,0…    11.7
5 2019  ALL RACES       128451     68703     904 98088   1042    "$50,0…    16.5
6 2019  ALL RACES       128451     68703     904 98088   1042    "$75,0…    12.3
# … with abbreviated variable names ¹MedianEstimate, ²MedianEstimateMOE,
#   ³MeanEstimate, ⁴MeanEstimateMOE, ⁵IncomeType, ⁶Percentage

Any additional comments?