Challenge 3 Australian Marriage Pivot

challenge_3

australian_marriage

Tidy Data: Pivoting

Author

Kekai Liu

Published

March 7, 2023

Code

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
identify what needs to be done to tidy the current data
anticipate the shape of pivoted data
pivot the data into tidy format using pivot_longer

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

animal_weights.csv ⭐
eggs_tidy.csv ⭐⭐ or organiceggpoultry.xls ⭐⭐⭐
australian_marriage*.xls ⭐⭐⭐
USA Households*.xlsx ⭐⭐⭐⭐
sce_labor_chart_data_public.xlsx 🌟🌟🌟🌟🌟

The data is from the 2017 Australian Marriage Law Postal Survey. It reports the number of yeses and nos from survey respondents at the district level. The data is currently in a wide format and not “tidy.” Tidy data requires that every column variable, but there are three count and three percent variables. Pivoting is required to transform the data from wide to long format, by collapsing the count and percent variables into a single count and a single percent variable with the type of response serving as an ID number. The data also has a header and footnotes which need to be removed.

Code

aus_marriage <- readxl::read_excel("_data/australian_marriage_law_postal_survey_2017_-_response_final.xls", sheet="Table 2", skip=7)

aus_marriage <- aus_marriage[1:172,] #remove footnotes

aus_marriage$division <- NA #initialize division variable

aus_marriage$division[1:47] <- "New South Wales"
aus_marriage$division[51:87] <- "Victoria"
aus_marriage$division[91:120] <- "Queensland"
aus_marriage$division[124:134] <- "South Australia"
aus_marriage$division[157:161] <- "Tasmania"
aus_marriage$division[165:166] <- "Northern Territory"
aus_marriage$division[170:171] <- "Australian Capital Territory"

aus_marriage <- aus_marriage %>%
  filter(!is.na(division)) %>% #remove any invalid divisions
    rename(x1 = 1, x2 = 2, x3 = 3, x4 = 4, x5 = 5, x6 = 6, x7 = 7, x8 = 8, x9 = 9, x10 = 10, x11 = 11, x12 = 12, x13 = 13, x14 = 14, x15 = 15, x16 = 16) %>%
  mutate(x1 = gsub("\\(c)", "", x1)) %>%
  mutate(x1 = gsub("\\(d)", "", x1)) %>%
  mutate(x1 = gsub("\\(e)", "", x1)) #eliminate footnote letters in district name

head(aus_marriage)

# A tibble: 6 × 17
  x1        x2    x3    x4    x5    x6    x7 x8       x9   x10   x11   x12   x13
  <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Banks  37736  44.9 46343  55.1 84079   100 NA    84079  79.9   247   0.2 20928
2 Barton 37153  43.6 47984  56.4 85137   100 NA    85137  77.8   226   0.2 24008
3 Benne… 42943  49.8 43215  50.2 86158   100 NA    86158  81     244   0.2 19973
4 Berow… 48471  54.6 40369  45.4 88840   100 NA    88840  84.5   212   0.2 16038
5 Blaxl… 20406  26.1 57926  73.9 78332   100 NA    78332  75     220   0.2 25883
6 Bradf… 53681  60.6 34927  39.4 88608   100 NA    88608  83.5   202   0.2 17261
# … with 4 more variables: x14 <dbl>, x15 <dbl>, x16 <dbl>, division <chr>

Briefly describe the data

Describe the data, and be sure to comment on why you are planning to pivot it to make it “tidy”

Anticipate the End Result

The first step in pivoting the data is to try to come up with a concrete vision of what the end product should look like - that way you will know whether or not your pivoting was successful.

One easy way to do this is to think about the dimensions of your current data (tibble, dataframe, or matrix), and then calculate what the dimensions of the pivoted data should be.

Suppose you have a dataset with \(n\) rows and \(k\) variables. In our example, 3 of the variables are used to identify a case, so you will be pivoting \(k-3\) variables into a longer format where the \(k-3\) variable names will move into the names_to variable and the current values in each of those columns will move into the values_to variable. Therefore, we would expect \(n * (k-3)\) rows in the pivoted dataframe!

Example: find current and future data dimensions

Lets see if this works with a simple example.

Code

df<-tibble(country = rep(c("Mexico", "USA", "France"),2),
           year = rep(c(1980,1990), 3), 
           trade = rep(c("NAFTA", "NAFTA", "EU"),2),
           outgoing = rnorm(6, mean=1000, sd=500),
           incoming = rlogis(6, location=1000, 
                             scale = 400))
df

# A tibble: 6 × 5
  country  year trade outgoing incoming
  <chr>   <dbl> <chr>    <dbl>    <dbl>
1 Mexico   1980 NAFTA     995.    1062.
2 USA      1990 NAFTA     307.    1021.
3 France   1980 EU       1200.     546.
4 Mexico   1990 NAFTA    1519.     335.
5 USA      1980 NAFTA     780.    1463.
6 France   1990 EU       1323.    1363.

Code

#existing rows/cases
nrow(df)

[1] 6

Code

#existing columns/cases
ncol(df)

[1] 5

Code

#expected rows/cases
nrow(df) * (ncol(df)-3)

[1] 12

Code

# expected columns 
3 + 2

[1] 5

Or simple example has \(n = 6\) rows and \(k - 3 = 2\) variables being pivoted, so we expect a new dataframe to have \(n * 2 = 12\) rows x \(3 + 2 = 5\) columns.

Challenge: Describe the final dimensions

Document your work here.

Implicitly, there are two pivots going on. First, the yes and no counts are pivoted into a single variable. Secondly, the yes and no percentages are pivoted into a single variable. There is only one response type variable (categorizing yes or no) So, there are two variables used to identify a case (division and district) and 4 - 2 = 2 variables to be pivoted. There are 5 columns in the new dataframe, because k = 2 + 2 + 1 (division + district + count + percent + resp). There are 268 columns in the new dataframe.

Code

aus_marriage2 <- aus_marriage %>%
    select(x1, x2, x3, x4, x5, division) %>%
  rename(district = x1, yes1count=x2, yes1percent=x3, no2count=x4, no2percent=x5)

#existing rows/cases
nrow(aus_marriage2)

[1] 134

Code

#existing columns/cases
ncol(aus_marriage2)

[1] 6

Code

#expected rows/cases
nrow(aus_marriage2) * (ncol(aus_marriage2)-2)

[1] 536

Code

# expected columns 
2 + 2 + 1

[1] 5

Any additional comments?

Pivot the Data

Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a “sanity” check.

Example

Code

df<-pivot_longer(df, col = c(outgoing, incoming),
                 names_to="trade_direction",
                 values_to = "trade_value")
df

# A tibble: 12 × 5
   country  year trade trade_direction trade_value
   <chr>   <dbl> <chr> <chr>                 <dbl>
 1 Mexico   1980 NAFTA outgoing               995.
 2 Mexico   1980 NAFTA incoming              1062.
 3 USA      1990 NAFTA outgoing               307.
 4 USA      1990 NAFTA incoming              1021.
 5 France   1980 EU    outgoing              1200.
 6 France   1980 EU    incoming               546.
 7 Mexico   1990 NAFTA outgoing              1519.
 8 Mexico   1990 NAFTA incoming               335.
 9 USA      1980 NAFTA outgoing               780.
10 USA      1980 NAFTA incoming              1463.
11 France   1990 EU    outgoing              1323.
12 France   1990 EU    incoming              1363.

Yes, once it is pivoted long, our resulting data are \(12x5\) - exactly what we expected!

Challenge: Pivot the Chosen Data

Document your work here. What will a new “case” be once you have pivoted the data? How does it meet requirements for tidy data?

A case correspondents to a single type of response (yes or no) in single district in a division in Australia. This meets the requirements for tidy data, because each column represents a variable, each row corresponds to a single observation, and each cell is a single value. The tidy data set contains 268 rows and 5 columns.

Code

aus_marriage_tidy <- aus_marriage2 %>%
    pivot_longer(-c('district', 'division'), names_to=c("resp", ".value"), names_sep = "\\d")

aus_marriage_tidy

# A tibble: 268 × 5
   district  division        resp  count percent
   <chr>     <chr>           <chr> <dbl>   <dbl>
 1 Banks     New South Wales yes   37736    44.9
 2 Banks     New South Wales no    46343    55.1
 3 Barton    New South Wales yes   37153    43.6
 4 Barton    New South Wales no    47984    56.4
 5 Bennelong New South Wales yes   42943    49.8
 6 Bennelong New South Wales no    43215    50.2
 7 Berowra   New South Wales yes   48471    54.6
 8 Berowra   New South Wales no    40369    45.4
 9 Blaxland  New South Wales yes   20406    26.1
10 Blaxland  New South Wales no    57926    73.9
# … with 258 more rows

Any additional comments?

When one needs to pivot multiple columns into separate value columns for the same case variables, it is helpful to rename the pivoted columns into column names based on a pattern. This enables R to recognize which columns go into which new value columns.