Challenge 3 Instructions

challenge_3
Pivot
Author

Nayan Jani

Published

August 17, 2022

Code
library(tidyverse)
library(summarytools)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. identify what needs to be done to tidy the current data
  3. anticipate the shape of pivoted data
  4. pivot the data into tidy format using pivot_longer

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

  • animal_weights.csv ⭐
  • eggs_tidy.csv ⭐⭐ or organicpoultry.xls ⭐⭐⭐
  • australian_marriage*.xlsx ⭐⭐⭐
  • USA Households*.xlsx ⭐⭐⭐⭐
  • sce_labor_chart_data_public.csv 🌟🌟🌟🌟🌟
Code
eggs <-read_csv("_data/eggs_tidy.csv",
                        show_col_types = FALSE)
eggs
# A tibble: 120 × 6
   month      year large_half_dozen large_dozen extra_large_half_dozen extra_l…¹
   <chr>     <dbl>            <dbl>       <dbl>                  <dbl>     <dbl>
 1 January    2004             126         230                    132       230 
 2 February   2004             128.        226.                   134.      230 
 3 March      2004             131         225                    137       230 
 4 April      2004             131         225                    137       234.
 5 May        2004             131         225                    137       236 
 6 June       2004             134.        231.                   137       241 
 7 July       2004             134.        234.                   137       241 
 8 August     2004             134.        234.                   137       241 
 9 September  2004             130.        234.                   136.      241 
10 October    2004             128.        234.                   136.      241 
# … with 110 more rows, and abbreviated variable name ¹​extra_large_dozen
# ℹ Use `print(n = ...)` to see more rows
Code
print(dfSummary(eggs, varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

eggs

Dimensions: 120 x 6
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
month [character]
1. April
2. August
3. December
4. February
5. January
6. July
7. June
8. March
9. May
10. November
[ 2 others ]
10(8.3%)
10(8.3%)
10(8.3%)
10(8.3%)
10(8.3%)
10(8.3%)
10(8.3%)
10(8.3%)
10(8.3%)
10(8.3%)
20(16.7%)
0 (0.0%)
year [numeric]
Mean (sd) : 2008.5 (2.9)
min ≤ med ≤ max:
2004 ≤ 2008.5 ≤ 2013
IQR (CV) : 5 (0)
2004:12(10.0%)
2005:12(10.0%)
2006:12(10.0%)
2007:12(10.0%)
2008:12(10.0%)
2009:12(10.0%)
2010:12(10.0%)
2011:12(10.0%)
2012:12(10.0%)
2013:12(10.0%)
0 (0.0%)
large_half_dozen [numeric]
Mean (sd) : 155.2 (22.6)
min ≤ med ≤ max:
126 ≤ 174.5 ≤ 178
IQR (CV) : 45.1 (0.1)
126.00  :1(0.8%)
128.50  :29(24.2%)
129.75  :1(0.8%)
131.00  :3(2.5%)
131.12 !:1(0.8%)
132.00  :15(12.5%)
133.50  :3(2.5%)
173.25  :6(5.0%)
174.50  :47(39.2%)
178.00  :14(11.7%)
! rounded
0 (0.0%)
large_dozen [numeric]
Mean (sd) : 254.2 (18.5)
min ≤ med ≤ max:
225 ≤ 267.5 ≤ 277.5
IQR (CV) : 34.5 (0.1)
12 distinct values 0 (0.0%)
extra_large_half_dozen [numeric]
Mean (sd) : 164.2 (24.7)
min ≤ med ≤ max:
132 ≤ 185.5 ≤ 188.1
IQR (CV) : 49.7 (0.2)
132.00  :1(0.8%)
134.50  :1(0.8%)
135.50  :28(23.3%)
135.88 !:1(0.8%)
137.00  :6(5.0%)
138.12 !:1(0.8%)
139.00  :15(12.5%)
185.50  :53(44.2%)
188.13  :14(11.7%)
! rounded
0 (0.0%)
extra_large_dozen [numeric]
Mean (sd) : 266.8 (22.8)
min ≤ med ≤ max:
230 ≤ 285.5 ≤ 290
IQR (CV) : 44 (0.1)
11 distinct values 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-08-28

Briefly describe the data

This dataset is comprised of 120 rows with 6 variables. One variable is a character (month) and the rest are doubles. Based off the summary statistics of the egg variables, I can see that the variance of sales for each dozen of eggs are all high meaning that most cases are not near there average. ## Anticipate the End Result

The column names large_half_dozen, large_dozen, extra_large_half_dozen and extra_large_dozen represent values of the type variable, the values in the columns represent values of the sales variable, and each row represents multiple observations, not one.

Example: find current and future data dimensions

Lets see if this works with a simple example.

Code
df<-tibble(country = rep(c("Mexico", "USA", "France"),2),
           year = rep(c(1980,1990), 3), 
           trade = rep(c("NAFTA", "NAFTA", "EU"),2),
           outgoing = rnorm(6, mean=1000, sd=500),
           incoming = rlogis(6, location=1000, 
                             scale = 400))
df
# A tibble: 6 × 5
  country  year trade outgoing incoming
  <chr>   <dbl> <chr>    <dbl>    <dbl>
1 Mexico   1980 NAFTA    1256.     369.
2 USA      1990 NAFTA     998.     934.
3 France   1980 EU       2122.    -678.
4 Mexico   1990 NAFTA    2074.    1980.
5 USA      1980 NAFTA    1136.     602.
6 France   1990 EU       1423.    1876.
Code
#existing rows/cases
nrow(df)
[1] 6
Code
#existing columns/cases
ncol(df)
[1] 5
Code
#expected rows/cases
nrow(df) * (ncol(df)-3)
[1] 12
Code
# expected columns 
3 + 2
[1] 5

Or simple example has \(n = 6\) rows and \(k - 3 = 2\) variables being pivoted, so we expect a new dataframe to have \(n * 2 = 12\) rows x \(3 + 2 = 5\) columns.

Challenge: Describe the final dimensions

Document your work here.

Code
nrow(eggs)
[1] 120
Code
ncol(eggs)
[1] 6

What needs to changed is that the column names that include the values for dozens needs to become its own columns called type and the corresponding values for that type will be stored in a column names sales

Pivot the Data

Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a “sanity” check.

Example

Code
df<-pivot_longer(df, col = c(outgoing, incoming),
                 names_to="trade_direction",
                 values_to = "trade_value")
df
# A tibble: 12 × 5
   country  year trade trade_direction trade_value
   <chr>   <dbl> <chr> <chr>                 <dbl>
 1 Mexico   1980 NAFTA outgoing              1256.
 2 Mexico   1980 NAFTA incoming               369.
 3 USA      1990 NAFTA outgoing               998.
 4 USA      1990 NAFTA incoming               934.
 5 France   1980 EU    outgoing              2122.
 6 France   1980 EU    incoming              -678.
 7 Mexico   1990 NAFTA outgoing              2074.
 8 Mexico   1990 NAFTA incoming              1980.
 9 USA      1980 NAFTA outgoing              1136.
10 USA      1980 NAFTA incoming               602.
11 France   1990 EU    outgoing              1423.
12 France   1990 EU    incoming              1876.

Yes, once it is pivoted long, our resulting data are \(12x5\) - exactly what we expected!

Challenge: Pivot the Chosen Data

Document your work here. What will a new “case” be once you have pivoted the data? How does it meet requirements for tidy data?

Code
eggs <- pivot_longer(eggs, col = c(large_half_dozen, large_dozen, extra_large_half_dozen, extra_large_dozen),
                 names_to="type",
                 values_to = "sales")
eggs
# A tibble: 480 × 4
   month     year type                   sales
   <chr>    <dbl> <chr>                  <dbl>
 1 January   2004 large_half_dozen        126 
 2 January   2004 large_dozen             230 
 3 January   2004 extra_large_half_dozen  132 
 4 January   2004 extra_large_dozen       230 
 5 February  2004 large_half_dozen        128.
 6 February  2004 large_dozen             226.
 7 February  2004 extra_large_half_dozen  134.
 8 February  2004 extra_large_dozen       230 
 9 March     2004 large_half_dozen        131 
10 March     2004 large_dozen             225 
# … with 470 more rows
# ℹ Use `print(n = ...)` to see more rows

Now this dataset is in tidy form because each row is now one observation. I used pivot_longer because we needed more rows so that each observation could be individual in the dataset. When you go to read the dataset U can see each case only has one value attached to it.