challenge_3
animal_weights
Matthew_Weiner
Tidy Data: Pivoting
Author

Matthew Weiner

Published

March 24, 2023

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. identify what needs to be done to tidy the current data
  3. anticipate the shape of pivoted data
  4. pivot the data into tidy format using pivot_longer

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

  • animal_weights.csv ⭐
  • eggs_tidy.csv ⭐⭐ or organiceggpoultry.xls ⭐⭐⭐
  • australian_marriage*.xls ⭐⭐⭐
  • USA Households*.xlsx ⭐⭐⭐⭐
  • sce_labor_chart_data_public.xlsx 🌟🌟🌟🌟🌟

Introduction

For this challenge, I chose to investigate the animal_weights dataset.

Getting Started

First I had to import the necesary package in order to read the file. Then, in order to get an idea of what the data contained, I used the head() command to output the first few rows of the dataset and used the colnnames() command to view all column names.

Code
library(readr)
aw <- read_csv("_data/animal_weight.csv")
head(aw)
# A tibble: 6 × 17
  IPCC A…¹ Cattl…² Cattl…³ Buffa…⁴ Swine…⁵ Swine…⁶ Chick…⁷ Chick…⁸ Ducks Turkeys
  <chr>      <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <dbl>   <dbl>
1 Indian …     275     110     295      28      28     0.9     1.8   2.7     6.8
2 Eastern…     550     391     380      50     180     0.9     1.8   2.7     6.8
3 Africa       275     173     380      28      28     0.9     1.8   2.7     6.8
4 Oceania      500     330     380      45     180     0.9     1.8   2.7     6.8
5 Western…     600     420     380      50     198     0.9     1.8   2.7     6.8
6 Latin A…     400     305     380      28      28     0.9     1.8   2.7     6.8
# … with 7 more variables: Sheep <dbl>, Goats <dbl>, Horses <dbl>, Asses <dbl>,
#   Mules <dbl>, Camels <dbl>, Llamas <dbl>, and abbreviated variable names
#   ¹​`IPCC Area`, ²​`Cattle - dairy`, ³​`Cattle - non-dairy`, ⁴​Buffaloes,
#   ⁵​`Swine - market`, ⁶​`Swine - breeding`, ⁷​`Chicken - Broilers`,
#   ⁸​`Chicken - Layers`
Code
colnames(aw)
 [1] "IPCC Area"          "Cattle - dairy"     "Cattle - non-dairy"
 [4] "Buffaloes"          "Swine - market"     "Swine - breeding"  
 [7] "Chicken - Broilers" "Chicken - Layers"   "Ducks"             
[10] "Turkeys"            "Sheep"              "Goats"             
[13] "Horses"             "Asses"              "Mules"             
[16] "Camels"             "Llamas"            

This preliminary investigation into the dataset has already given me a good idea of the type of data that is involved. Due to the title of the dataset, and the fact that all columns except for the first are of the type dbl, tells me that this dataet describes the average weight of each type of animal for each IPCC area.

After viewing the column names, I could see that we will want to use the pivot_longer() command here in order to tidy up the data. The original dataset is in a wide format, where each column represents a variable, and the animal types are spread out across multiple columns. This makes it difficult to perform certain types of data analysis and visualization tasks because the data is not organized in a way that is easy to work with.

By using pivot_longer() we can mitigate this issue by converting the dataset to a long format where each row will be a combination of the IPCC Area, the Animal type, and the Value column which represents the average weight of the animals.

Pivoting Math

The original shape of the dataset is (9,17) where there are 9 rows and 17 columns. Our goal is to change this such that we will pivot all animal columns into a single column. This means all columns but IPCC Area will be pivoted. If we let k be the number of columns, then we want to pivot k-1 columns which is equivalent to k-1 = 16.

Code
#existing rows
nrow(aw)
[1] 9
Code
#existing cols
ncol(aw)
[1] 17
Code
#expected rows
nrow(aw) * (ncol(aw)-1)
[1] 144
Code
#expected col
1+2
[1] 3

As we can see above, that means that the resulting shape of the pivoted dataset will be (144,3), where there are 144 rows and 3 columns.

Pivot the Data

Code
pivot_aw <- pivot_longer(aw,cols = -`IPCC Area`, names_to = "Animal", values_to = "Value")

pivot_aw
# A tibble: 144 × 3
   `IPCC Area`         Animal             Value
   <chr>               <chr>              <dbl>
 1 Indian Subcontinent Cattle - dairy     275  
 2 Indian Subcontinent Cattle - non-dairy 110  
 3 Indian Subcontinent Buffaloes          295  
 4 Indian Subcontinent Swine - market      28  
 5 Indian Subcontinent Swine - breeding    28  
 6 Indian Subcontinent Chicken - Broilers   0.9
 7 Indian Subcontinent Chicken - Layers     1.8
 8 Indian Subcontinent Ducks                2.7
 9 Indian Subcontinent Turkeys              6.8
10 Indian Subcontinent Sheep               28  
# … with 134 more rows
Code
nrow(pivot_aw)
[1] 144
Code
ncol(pivot_aw)
[1] 3

After applying the pivot_longer() command, we can see that our hypothesis was right as there are 144 rows in this new dataset with 3 columns. The dataset is now much more readable and can more easily be used in the future for other investigative tasks.