Challenge 3 Submission

challenge_3
animal_weights
Tidy Data: Pivoting
Author

Suyash Bhagwat

Published

June 7, 2023

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. identify what needs to be done to tidy the current data
  3. anticipate the shape of pivoted data
  4. pivot the data into tidy format using pivot_longer

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

  • animal_weights.csv ⭐
Code
data_animals <- read_csv("_data/animal_weight.csv")
data_animals

Briefly describe the data

Describe the data, and be sure to comment on why you are planning to pivot it to make it “tidy”.

Ans: From the table above, it looks like the data lists the weight for different farm animals(domesticated animals) based on the IPCC Area.

Anticipate the End Result

The first step in pivoting the data is to try to come up with a concrete vision of what the end product should look like - that way you will know whether or not your pivoting was successful.

One easy way to do this is to think about the dimensions of your current data (tibble, dataframe, or matrix), and then calculate what the dimensions of the pivoted data should be.

Suppose you have a dataset with \(n\) rows and \(k\) variables. In our example, 3 of the variables are used to identify a case, so you will be pivoting \(k-3\) variables into a longer format where the \(k-3\) variable names will move into the names_to variable and the current values in each of those columns will move into the values_to variable. Therefore, we would expect \(n * (k-3)\) rows in the pivoted dataframe!

Example: find current and future data dimensions

Lets see if this works with a simple example.

Code
df<-tibble(country = rep(c("Mexico", "USA", "France"),2),
           year = rep(c(1980,1990), 3), 
           trade = rep(c("NAFTA", "NAFTA", "EU"),2),
           outgoing = rnorm(6, mean=1000, sd=500),
           incoming = rlogis(6, location=1000, 
                             scale = 400))
df

Example

Code
#existing rows/cases
nrow(df)
[1] 6
Code
#existing columns/cases
ncol(df)
[1] 5
Code
#expected rows/cases
nrow(df) * (ncol(df)-3)
[1] 12
Code
# expected columns 
3 + 2
[1] 5

Or simple example has \(n = 6\) rows and \(k - 3 = 2\) variables being pivoted, so we expect a new dataframe to have \(n * 2 = 12\) rows x \(3 + 2 = 5\) columns.

Challenge: Describe the final dimensions

Document your work here.

Ans: The original data_animals tibble contains 9 rows x 17 cols. Out of the 17 cols, only 1 (IPCC Area) is used to identify each case. The other columns can be pivoted. So we will be keeping 1 col constant and pivoting the other 16 cols. The final number of rows and cols in the pivoted table is given in the code below:

Code
#existing rows
print("Existing rows:")
[1] "Existing rows:"
Code
nrow(data_animals)
[1] 9
Code
#existing columns/cases
print("Existing columns:")
[1] "Existing columns:"
Code
ncol(data_animals)
[1] 17
Code
#expected rows/cases
print("Expected rows in the new pivoted table:")
[1] "Expected rows in the new pivoted table:"
Code
nrow(data_animals) * (ncol(data_animals)-1)
[1] 144
Code
# expected columns
print("Expected columns in the new pivoted table:")
[1] "Expected columns in the new pivoted table:"
Code
1+2
[1] 3

Any additional comments?

Pivot the Data

Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a “sanity” check.

Example

Code
df<-pivot_longer(df, col = c(outgoing, incoming),
                 names_to="trade_direction",
                 values_to = "trade_value")
df

Pivoted Example

Yes, once it is pivoted long, our resulting data are \(12x5\) - exactly what we expected!

Challenge: Pivot the Chosen Data

Document your work here. What will a new “case” be once you have pivoted the data? How does it meet requirements for tidy data?

Ans: A new case or observation in our tidy data will have three columns. The first is the IPCC Area, the second will be the type of farm_animal and the third is the weight of the farm animals.

In R, the requirements for tidy data are defined by the principles of tidy data, as outlined by Hadley Wickham, the creator of the tidyverse. Each variable should be represented by a separate column(e.g. farm_animal). Also, each observation or instance should be represented by a separate row in the data frame. This means that each row contains the values for each variable related to a specific observation.

Code
data_pivoted <- pivot_longer(
  data_animals,
  cols = c(`Cattle - dairy`:Llamas),
  names_to = "farm_animal",
  values_to = "weight"
)
data_pivoted
Code
#expected rows/cases
print("Actual rows in the new pivoted table:")
[1] "Actual rows in the new pivoted table:"
Code
nrow(data_pivoted)
[1] 144
Code
# expected columns
print("Actual columns in the new pivoted table:")
[1] "Actual columns in the new pivoted table:"
Code
ncol(data_pivoted)
[1] 3

Any additional comments?

Ans: As seen in the above code, the actual number of rows and columns (144 x 3) in the pivoted table match the expected number of rows and columns (144 x 3). Hence we have successfully pivoted the data_animals tibble.