Challenge 3

challenge_3

animal_weights

eggs

australian_marriage

usa_households

sce_labor

Tidy Data: Pivoting

Author

Shuqi Hong

Published

June 7, 2023

Code

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

eggs <- read.csv("_data/eggs_tidy.csv")

head(eggs)

     month year large_half_dozen large_dozen extra_large_half_dozen
1  January 2004            126.0     230.000                  132.0
2 February 2004            128.5     226.250                  134.5
3    March 2004            131.0     225.000                  137.0
4    April 2004            131.0     225.000                  137.0
5      May 2004            131.0     225.000                  137.0
6     June 2004            133.5     231.375                  137.0
  extra_large_dozen
1             230.0
2             230.0
3             230.0
4             234.5
5             236.0
6             241.0

From the original file, it looks like the file is about the number of purchase of different sizes eggs in each year and months.

We can look the trend of the number in years. So first we drop the month colunm and sum all the number by years. The tibble is wide and not clear to see the year trend, so we pivot it.

Code

yeartrend <- eggs %>% select(-month) %>%group_by(year) %>%
  summarise_all(sum) %>%
  t()
yeartrend

                           [,1] [,2]     [,3]     [,4]   [,5] [,6]   [,7]
year                   2004.000 2005 2006.000 2007.000 2008.0 2009 2010.0
large_half_dozen       1563.250 1542 1542.000 1579.625 1881.5 2094 2094.0
large_dozen            2763.625 2802 2802.000 2839.625 3127.5 3300 3221.5
extra_large_half_dozen 1630.875 1626 1626.000 1663.625 1993.5 2226 2226.0
extra_large_dozen      2847.500 2892 2896.875 2935.625 3223.5 3426 3426.0
                         [,8]    [,9]   [,10]
year                   2011.0 2012.00 2013.00
large_half_dozen       2094.0 2093.50 2136.00
large_dozen            3227.5 3210.00 3210.00
extra_large_half_dozen 2226.0 2231.26 2257.56
extra_large_dozen      3426.0 3462.00 3480.00

Code

eggs %>% select(-month) %>%group_by(year) %>%
  pivot_wider(names_from = year, values_from = c(large_half_dozen,large_dozen))

# A tibble: 15 × 22
   extra_large_half_dozen extra_large_dozen large_half_dozen_2004
                    <dbl>             <dbl> <list>               
 1                   132               230  <dbl [1]>            
 2                   134.              230  <dbl [1]>            
 3                   137               230  <dbl [1]>            
 4                   137               234. <dbl [1]>            
 5                   137               236  <dbl [1]>            
 6                   137               241  <dbl [3]>            
 7                   136.              241  <dbl [1]>            
 8                   136.              241  <dbl [3]>            
 9                   136.              241. <NULL>               
10                   136.              242. <NULL>               
11                   138.              244. <NULL>               
12                   139               245  <NULL>               
13                   186.              286. <NULL>               
14                   186.              288. <NULL>               
15                   188.              290  <NULL>               
# ℹ 19 more variables: large_half_dozen_2005 <list>,
#   large_half_dozen_2006 <list>, large_half_dozen_2007 <list>,
#   large_half_dozen_2008 <list>, large_half_dozen_2009 <list>,
#   large_half_dozen_2010 <list>, large_half_dozen_2011 <list>,
#   large_half_dozen_2012 <list>, large_half_dozen_2013 <list>,
#   large_dozen_2004 <list>, large_dozen_2005 <list>, large_dozen_2006 <list>,
#   large_dozen_2007 <list>, large_dozen_2008 <list>, …

I expect that the colunm is each year and row is each dozen, which is like the dataframe after using t(). The pivot function seems like unvailable in this situation.

Code

eggs %>% group_by(year)  %>% pivot_longer(`large_half_dozen`: `extra_large_dozen`, names_to = "dozen", values_to = "number" ) %>% group_by(dozen,year) %>% 
  summarise( mean = mean(`number`)) %>% 
  pivot_wider(names_from = year, values_from = mean)

# A tibble: 4 × 11
# Groups:   dozen [4]
  dozen    `2004` `2005` `2006` `2007` `2008` `2009` `2010` `2011` `2012` `2013`
  <chr>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 extra_l…   237.   241    241.   245.   269.   286.   286.   286.   288.   290 
2 extra_l…   136.   136.   136.   139.   166.   186.   186.   186.   186.   188.
3 large_d…   230.   234.   234.   237.   261.   275    268.   269.   268.   268.
4 large_h…   130.   128.   128.   132.   157.   174.   174.   174.   174.   178

From this tibble, we can see the year trend of each dozen much clearer. We can also know people bought more extra large dozen and large dozen than other two scales in general.

Code

mean_January <- eggs %>% filter(month == "January") %>% 
  pivot_longer(`large_half_dozen`: `extra_large_dozen`, names_to = "January_dozen", values_to = "number" ) %>%
  group_by(January_dozen,year) %>%
  summarise( January_mean = mean(`number`)) %>% 
  pivot_wider(names_from = year, values_from = January_mean)
  
mean_January

# A tibble: 4 × 11
# Groups:   January_dozen [4]
  January_dozen   `2004` `2005` `2006` `2007` `2008` `2009` `2010` `2011` `2012`
  <chr>            <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 extra_large_do…    230   241    241    242.    245   286.   286.   286.   286.
2 extra_large_ha…    132   136.   136.   136.    139   186.   186.   186.   186.
3 large_dozen        230   234.   234.   234.    237   278.   272.   268.   268.
4 large_half_doz…    126   128.   128.   128.    132   174.   174.   174.   174.
# ℹ 1 more variable: `2013` <dbl>

We can also choose a specific month like January to see. The average value of extra_large_half_dozen in January from 2009 to 2012 didn’t change which means it was quite steady. All scales of dozen had a large increase between 2008 and 2009. It should happen something in January 2009. Even though the general trend is increasing, the large_dozen has a little decrease from 2009.

Example: find current and future data dimensions

One easy way to do this is to think about the dimensions of your current data (tibble, dataframe, or matrix), and then calculate what the dimensions of the pivoted data should be.

Suppose you have a dataset with \(n\) rows and \(k\) variables. In our example, 3 of the variables are used to identify a case, so you will be pivoting \(k-3\) variables into a longer format where the \(k-3\) variable names will move into the names_to variable and the current values in each of those columns will move into the values_to variable. Therefore, we would expect \(n * (k-3)\) rows in the pivoted dataframe!

Lets see if this works with a simple example.

Code

df<-tibble(country = rep(c("Mexico", "USA", "France"),2),
           year = rep(c(1980,1990), 3), 
           trade = rep(c("NAFTA", "NAFTA", "EU"),2),
           outgoing = rnorm(6, mean=1000, sd=500),
           incoming = rlogis(6, location=1000, 
                             scale = 400))
df

# A tibble: 6 × 5
  country  year trade outgoing incoming
  <chr>   <dbl> <chr>    <dbl>    <dbl>
1 Mexico   1980 NAFTA     83.8     679.
2 USA      1990 NAFTA   2184.      446.
3 France   1980 EU       777.     1275.
4 Mexico   1990 NAFTA   1325.     1354.
5 USA      1980 NAFTA   1026.     1063.
6 France   1990 EU      1217.     1813.

Code

#existing rows/cases
nrow(df)

[1] 6

Code

#existing columns/cases
ncol(df)

[1] 5

Code

#expected rows/cases
nrow(df) * (ncol(df)-3)

[1] 12

Code

# expected columns 
3 + 2

[1] 5

Or simple example has \(n = 6\) rows and \(k - 3 = 2\) variables being pivoted, so we expect a new dataframe to have \(n * 2 = 12\) rows x \(3 + 2 = 5\) columns.

Pivot the Data

Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a “sanity” check.

Example

Code

df<-pivot_longer(df, col = c(outgoing, incoming),
                 names_to="trade_direction",
                 values_to = "trade_value")
df

# A tibble: 12 × 5
   country  year trade trade_direction trade_value
   <chr>   <dbl> <chr> <chr>                 <dbl>
 1 Mexico   1980 NAFTA outgoing               83.8
 2 Mexico   1980 NAFTA incoming              679. 
 3 USA      1990 NAFTA outgoing             2184. 
 4 USA      1990 NAFTA incoming              446. 
 5 France   1980 EU    outgoing              777. 
 6 France   1980 EU    incoming             1275. 
 7 Mexico   1990 NAFTA outgoing             1325. 
 8 Mexico   1990 NAFTA incoming             1354. 
 9 USA      1980 NAFTA outgoing             1026. 
10 USA      1980 NAFTA incoming             1063. 
11 France   1990 EU    outgoing             1217. 
12 France   1990 EU    incoming             1813.

Yes, once it is pivoted long, our resulting data are \(12x5\) - exactly what we expected!