Code
library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Nanci Kopecky
March 6, 2023
Today’s challenge is to:
pivot_longer
Read in one (or more) of the following datasets, using the correct R package and command.
Describe the data, and be sure to comment on why you are planning to pivot it to make it “tidy”
This data file has nine geographic areas (9 observaitions) defined by the Inter-governmental Panel on Climate Change (IPCC) and the weights of seventeen animals (17 varioables). One idea is to pivot or organize the data to compare weights of animals in different geographic areas.
IPCC.Area Cattle...dairy Cattle...non.dairy Buffaloes
1 Indian Subcontinent 275 110 295
2 Eastern Europe 550 391 380
3 Africa 275 173 380
4 Oceania 500 330 380
5 Western Europe 600 420 380
6 Latin America 400 305 380
Swine...market Swine...breeding Chicken...Broilers Chicken...Layers Ducks
1 28 28 0.9 1.8 2.7
2 50 180 0.9 1.8 2.7
3 28 28 0.9 1.8 2.7
4 45 180 0.9 1.8 2.7
5 50 198 0.9 1.8 2.7
6 28 28 0.9 1.8 2.7
Turkeys Sheep Goats Horses Asses Mules Camels Llamas
1 6.8 28.0 30.0 238 130 130 217 217
2 6.8 48.5 38.5 377 130 130 217 217
3 6.8 28.0 30.0 238 130 130 217 217
4 6.8 48.5 38.5 377 130 130 217 217
5 6.8 48.5 38.5 377 130 130 217 217
6 6.8 28.0 30.0 238 130 130 217 217
[1] 17
[1] 9
The first step in pivoting the data is to try to come up with a concrete vision of what the end product should look like - that way you will know whether or not your pivoting was successful.
One easy way to do this is to think about the dimensions of your current data (tibble, dataframe, or matrix), and then calculate what the dimensions of the pivoted data should be.
Suppose you have a dataset with \(n\) rows and \(k\) variables. In our example, 3 of the variables are used to identify a case, so you will be pivoting \(k-3\) variables into a longer format where the \(k-3\) variable names will move into the names_to
variable and the current values in each of those columns will move into the values_to
variable. Therefore, we would expect \(n * (k-3)\) rows in the pivoted dataframe!
Used the %>% pipe feature with the mutate function to add a column.
IPCC.Area Cattle...dairy Cattle...non.dairy
1 Indian Subcontinent 275 110
2 Eastern Europe 550 391
3 Africa 275 173
4 Oceania 500 330
5 Western Europe 600 420
6 Latin America 400 305
7 Asia 350 391
8 Middle east 275 173
9 Northern America 604 389
Cattle...dairy - Cattle...non.dairy Buffaloes Swine...market Swine...breeding
1 165 295 28 28
2 159 380 50 180
3 102 380 28 28
4 170 380 45 180
5 180 380 50 198
6 95 380 28 28
7 -41 380 50 180
8 102 380 28 28
9 215 380 46 198
Chicken...Broilers Chicken...Layers Ducks Turkeys Sheep Goats Horses Asses
1 0.9 1.8 2.7 6.8 28.0 30.0 238 130
2 0.9 1.8 2.7 6.8 48.5 38.5 377 130
3 0.9 1.8 2.7 6.8 28.0 30.0 238 130
4 0.9 1.8 2.7 6.8 48.5 38.5 377 130
5 0.9 1.8 2.7 6.8 48.5 38.5 377 130
6 0.9 1.8 2.7 6.8 28.0 30.0 238 130
7 0.9 1.8 2.7 6.8 48.5 38.5 377 130
8 0.9 1.8 2.7 6.8 28.0 30.0 238 130
9 0.9 1.8 2.7 6.8 48.5 38.5 377 130
Mules Camels Llamas
1 130 217 217
2 130 217 217
3 130 217 217
4 130 217 217
5 130 217 217
6 130 217 217
7 130 217 217
8 130 217 217
9 130 217 217
IPCC.Area Cattle...dairy Cattle...non.dairy Buffaloes
Length:9 Min. :275.0 Min. :110 Min. :295.0
Class :character 1st Qu.:275.0 1st Qu.:173 1st Qu.:380.0
Mode :character Median :400.0 Median :330 Median :380.0
Mean :425.4 Mean :298 Mean :370.6
3rd Qu.:550.0 3rd Qu.:391 3rd Qu.:380.0
Max. :604.0 Max. :420 Max. :380.0
Swine...market Swine...breeding Chicken...Broilers Chicken...Layers
Min. :28.00 Min. : 28.0 Min. :0.9 Min. :1.8
1st Qu.:28.00 1st Qu.: 28.0 1st Qu.:0.9 1st Qu.:1.8
Median :45.00 Median :180.0 Median :0.9 Median :1.8
Mean :39.22 Mean :116.4 Mean :0.9 Mean :1.8
3rd Qu.:50.00 3rd Qu.:180.0 3rd Qu.:0.9 3rd Qu.:1.8
Max. :50.00 Max. :198.0 Max. :0.9 Max. :1.8
Ducks Turkeys Sheep Goats Horses
Min. :2.7 Min. :6.8 Min. :28.00 Min. :30.00 Min. :238.0
1st Qu.:2.7 1st Qu.:6.8 1st Qu.:28.00 1st Qu.:30.00 1st Qu.:238.0
Median :2.7 Median :6.8 Median :48.50 Median :38.50 Median :377.0
Mean :2.7 Mean :6.8 Mean :39.39 Mean :34.72 Mean :315.2
3rd Qu.:2.7 3rd Qu.:6.8 3rd Qu.:48.50 3rd Qu.:38.50 3rd Qu.:377.0
Max. :2.7 Max. :6.8 Max. :48.50 Max. :38.50 Max. :377.0
Asses Mules Camels Llamas
Min. :130 Min. :130 Min. :217 Min. :217
1st Qu.:130 1st Qu.:130 1st Qu.:217 1st Qu.:217
Median :130 Median :130 Median :217 Median :217
Mean :130 Mean :130 Mean :217 Mean :217
3rd Qu.:130 3rd Qu.:130 3rd Qu.:217 3rd Qu.:217
Max. :130 Max. :130 Max. :217 Max. :217
mean_Cattledairy mean_Cattlenondairy sd_Cattledairy sd_Cattlenondairy
1 425.4444 298 140.3853 116.2594
[1] 17
[1] 9
# A tibble: 5 × 3
x y z
<int> <dbl> <dbl>
1 1 2 3
2 2 2 4
3 3 2 5
4 4 2 6
5 5 2 7
# A tibble: 1 × 3
`: (` `: )` `24`
<chr> <chr> <chr>
1 sad happy 24
# A tibble: 9 × 17
IPCC.A…¹ Cattl…² Cattl…³ Buffa…⁴ Swine…⁵ Swine…⁶ Chick…⁷ Chick…⁸ Ducks Turkeys
<chr> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
1 Indian … 275 110 295 28 28 0.9 1.8 2.7 6.8
2 Eastern… 550 391 380 50 180 0.9 1.8 2.7 6.8
3 Africa 275 173 380 28 28 0.9 1.8 2.7 6.8
4 Oceania 500 330 380 45 180 0.9 1.8 2.7 6.8
5 Western… 600 420 380 50 198 0.9 1.8 2.7 6.8
6 Latin A… 400 305 380 28 28 0.9 1.8 2.7 6.8
7 Asia 350 391 380 50 180 0.9 1.8 2.7 6.8
8 Middle … 275 173 380 28 28 0.9 1.8 2.7 6.8
9 Norther… 604 389 380 46 198 0.9 1.8 2.7 6.8
# … with 7 more variables: Sheep <dbl>, Goats <dbl>, Horses <int>, Asses <int>,
# Mules <int>, Camels <int>, Llamas <int>, and abbreviated variable names
# ¹IPCC.Area, ²Cattle...dairy, ³Cattle...non.dairy, ⁴Buffaloes,
# ⁵Swine...market, ⁶Swine...breeding, ⁷Chicken...Broilers, ⁸Chicken...Layers
The first data frame is accurate, but not sure how useful. The second dataframe was just exploring the structure of the data frames, it is not useful or accurate, but I understand what the number means after the array c.
# A tibble: 9 × 3
IPCC.Area CattleDairy CattleNonDairy
<chr> <dbl> <dbl>
1 Indian Sub 507. 248.
2 E. Europe 369. 465.
3 Africa 520. 202.
4 Oceania 394. 283.
5 W. Europe 320. 470.
6 Latin America 569. 529.
7 Asia 582. 236.
8 Middle East 476. 467.
9 N. America 229. 110.
# A tibble: 18 × 4
CattleType IPCC.Area CattleDairy CattleNonDairy
<chr> <chr> <dbl> <dbl>
1 Dairy Indian Sub 265. 138.
2 Non Dairy E. Europe 466. 329.
3 Dairy Africa 447. 312.
4 Non Dairy Oceania 262. 181.
5 Dairy W. Europe 336. 199.
6 Non Dairy Latin America 319. 294.
7 Dairy Asia 436. 302.
8 Non Dairy Middle East 651. 286.
9 Dairy N. America 36.0 258.
10 Non Dairy Indian Sub 801. 475.
11 Dairy E. Europe 467. 299.
12 Non Dairy Africa 345. 179.
13 Dairy Oceania 252. 454.
14 Non Dairy W. Europe 286. 196.
15 Dairy Latin America 645. 356.
16 Non Dairy Asia 409. 302.
17 Dairy Middle East 399. 112.
18 Non Dairy N. America 119. 85.9
# A tibble: 2 × 3
CattleType IndianSub E.Europe
<chr> <dbl> <dbl>
1 Dairy 361. 512.
2 Non Dairy 300. 280.
Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a “sanity” check.
Pivot_Longer command
# A tibble: 18 × 3
IPCC.Area Cattle_Type Animal_Weight
<chr> <chr> <dbl>
1 Indian Sub CattleDairy 507.
2 Indian Sub CattleNonDairy 248.
3 E. Europe CattleDairy 369.
4 E. Europe CattleNonDairy 465.
5 Africa CattleDairy 520.
6 Africa CattleNonDairy 202.
7 Oceania CattleDairy 394.
8 Oceania CattleNonDairy 283.
9 W. Europe CattleDairy 320.
10 W. Europe CattleNonDairy 470.
11 Latin America CattleDairy 569.
12 Latin America CattleNonDairy 529.
13 Asia CattleDairy 582.
14 Asia CattleNonDairy 236.
15 Middle East CattleDairy 476.
16 Middle East CattleNonDairy 467.
17 N. America CattleDairy 229.
18 N. America CattleNonDairy 110.
---
title: "Challenge 3"
author: "Nanci Kopecky"
description: "Tidy Data: Pivoting Animal Weights"
date: "3/6/23"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_3
- animal_weights
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to:
1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
2. identify what needs to be done to tidy the current data
3. anticipate the shape of pivoted data
4. pivot the data into tidy format using `pivot_longer`
## Read in data
Read in one (or more) of the following datasets, using the correct R package and command.
- animal_weights.csv ⭐
- eggs_tidy.csv ⭐⭐ or organiceggpoultry.xls ⭐⭐⭐
- australian_marriage\*.xls ⭐⭐⭐
- USA Households\*.xlsx ⭐⭐⭐⭐
- sce_labor_chart_data_public.xlsx 🌟🌟🌟🌟🌟
```{r}
library(readr)
animal_weight<-read.csv(file = "_data/animal_weight.csv",
header=TRUE,
sep = ","
)
```
### Briefly describe the data
Describe the data, and be sure to comment on why you are planning to pivot it to make it "tidy"
This data file has nine geographic areas (9 observaitions) defined by the Inter-governmental Panel on Climate Change (IPCC) and the weights of seventeen animals (17 varioables). One idea is to pivot or organize the data to compare weights of animals in different geographic areas.
```{r}
head(animal_weight)
view(animal_weight)
ncol(animal_weight)
nrow(animal_weight)
```
## Anticipate the End Result
The first step in pivoting the data is to try to come up with a concrete vision of what the end product *should* look like - that way you will know whether or not your pivoting was successful.
One easy way to do this is to think about the dimensions of your current data (tibble, dataframe, or matrix), and then calculate what the dimensions of the pivoted data should be.
Suppose you have a dataset with $n$ rows and $k$ variables. In our example, 3 of the variables are used to identify a case, so you will be pivoting $k-3$ variables into a longer format where the $k-3$ variable names will move into the `names_to` variable and the current values in each of those columns will move into the `values_to` variable. Therefore, we would expect $n * (k-3)$ rows in the pivoted dataframe!
### Example: find current and future data dimensions
Used the %\>% pipe feature with the mutate function to add a column.
```{r}
#| tbl-cap: Example
animal_weight%>%mutate(Cattle...dairy - Cattle...non.dairy, .after = Cattle...non.dairy)
summary(animal_weight)
animal_weight %>%
summarize(
mean_Cattledairy=mean(Cattle...dairy),
mean_Cattlenondairy=mean(Cattle...non.dairy),
sd_Cattledairy=sd(Cattle...dairy),
sd_Cattlenondairy=sd(Cattle...non.dairy)
)
ncol(animal_weight)
nrow(animal_weight)
```
## Practicing basics of tibble
```{r}
tb<-tibble(x=1:5,y=2,z=x+y)
tb
tb1<-tibble(': (' = "sad", ': )' = "happy",'24'="24")
tb1
as_tibble(animal_weight)
```
## Trying Data Frames
The first data frame is accurate, but not sure how useful. The second dataframe was just exploring the structure of the data frames, it is not useful or accurate, but I understand what the number means after the array c.
```{r}
df1<-tibble(IPCC.Area = rep(c("Indian Sub", "E. Europe", "Africa", "Oceania", "W. Europe", "Latin America", "Asia", "Middle East", "N. America"),1),
CattleDairy = rnorm(9, mean=425.44, sd=140.39),
CattleNonDairy = rnorm(9, mean=298, sd=116.26))
df1
df2<-tibble(CattleType=rep(c("Dairy", "Non Dairy"), 9),
IPCC.Area =rep(c("Indian Sub", "E. Europe", "Africa", "Oceania", "W. Europe", "Latin America", "Asia", "Middle East", "N. America"),2),
CattleDairy = rnorm(18,mean=425.44, sd=140.39),
CattleNonDairy = rnorm(18, mean=298, sd=116.26)
)
df2
```
```{r}
df3<-tibble(CattleType=rep(c("Dairy", "Non Dairy"), 1),
IndianSub=rep(c(rnorm(1,mean=425.44, sd=140.39), rnorm(1, mean=298, sd=116.26))),
E.Europe=rep(c(rnorm(1,mean=425.44, sd=140.39), rnorm(1, mean=298, sd=116.26))))
df3
```
## Pivot the Data
Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a "sanity" check.
### Example
Pivot_Longer command
```{r}
df1<-pivot_longer(df1, col = c(CattleDairy, CattleNonDairy),
names_to = "Cattle_Type",
values_to = "Animal_Weight")
df1
```