DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 3 Instructions

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Challenge Overview
  • Read in data
    • Briefly describe the data
    • Example: find current and future data dimensions
    • Challenge: Describe the final dimensions
    • Challenge: Pivot the Chosen Data

Challenge 3 Instructions

  • Show All Code
  • Hide All Code

  • View Source
challenge_3
animal_weights
eggs
australian_marriage
usa_households
sce_labor
Author

Kim Darkenwald

Published

October 18, 2022

Code
library(tidyverse)
library(readr)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. identify what needs to be done to tidy the current data
  3. anticipate the shape of pivoted data
  4. pivot the data into tidy format using pivot_longer

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

  • animal_weights.csv ⭐
  • eggs_tidy.csv ⭐⭐ or organiceggpoultry.xls ⭐⭐⭐
  • australian_marriage*.xls ⭐⭐⭐
  • USA Households*.xlsx ⭐⭐⭐⭐
  • sce_labor_chart_data_public.xlsx 🌟🌟🌟🌟🌟
Code
animalweights <- read_csv("_data/animal_weight.csv")
animalweights
# A tibble: 9 × 17
  IPCC A…¹ Cattl…² Cattl…³ Buffa…⁴ Swine…⁵ Swine…⁶ Chick…⁷ Chick…⁸ Ducks Turkeys
  <chr>      <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <dbl>   <dbl>
1 Indian …     275     110     295      28      28     0.9     1.8   2.7     6.8
2 Eastern…     550     391     380      50     180     0.9     1.8   2.7     6.8
3 Africa       275     173     380      28      28     0.9     1.8   2.7     6.8
4 Oceania      500     330     380      45     180     0.9     1.8   2.7     6.8
5 Western…     600     420     380      50     198     0.9     1.8   2.7     6.8
6 Latin A…     400     305     380      28      28     0.9     1.8   2.7     6.8
7 Asia         350     391     380      50     180     0.9     1.8   2.7     6.8
8 Middle …     275     173     380      28      28     0.9     1.8   2.7     6.8
9 Norther…     604     389     380      46     198     0.9     1.8   2.7     6.8
# … with 7 more variables: Sheep <dbl>, Goats <dbl>, Horses <dbl>, Asses <dbl>,
#   Mules <dbl>, Camels <dbl>, Llamas <dbl>, and abbreviated variable names
#   ¹​`IPCC Area`, ²​`Cattle - dairy`, ³​`Cattle - non-dairy`, ⁴​Buffaloes,
#   ⁵​`Swine - market`, ⁶​`Swine - breeding`, ⁷​`Chicken - Broilers`,
#   ⁸​`Chicken - Layers`
Code
dim(animalweights)
[1]  9 17
Code
colnames(animalweights)
 [1] "IPCC Area"          "Cattle - dairy"     "Cattle - non-dairy"
 [4] "Buffaloes"          "Swine - market"     "Swine - breeding"  
 [7] "Chicken - Broilers" "Chicken - Layers"   "Ducks"             
[10] "Turkeys"            "Sheep"              "Goats"             
[13] "Horses"             "Asses"              "Mules"             
[16] "Camels"             "Llamas"            

Briefly describe the data

As indicated in our data,for the most part, animals share similar weights around regions of the globe. However, when it comes to buffalo, cattle, and swine, there are distinct differences in weight. Animals of these categories in particular appear to be much larger in weight in the Northern American and European regions while the regions of the Middle East, Africa, and the Indian Subcontinent contain animals of significantly less weight.

Suppose you have a dataset with \(n\) rows and \(k\) variables. In our example, 3 of the variables are used to identify a case, so you will be pivoting \(k-3\) variables into a longer format where the \(k-3\) variable names will move into the names_to variable and the current values in each of those columns will move into the values_to variable. Therefore, we would expect \(n * (k-3)\) rows in the pivoted dataframe!

Example: find current and future data dimensions

Lets see if this works with a simple example.

Code

#existing rows/cases
nrow(df)  = 16

#existing columns/cases
ncol(df) = 9

#expected rows/cases
nrow(df) * (ncol(df)-3) = 17 * (9-3)
= 16 * 8



# expected columns 

144
Error: <text>:10:1: unexpected '='
9: nrow(df) * (ncol(df)-3) = 17 * (9-3)
10: =
    ^

Or simple example has \(n = 6\) rows and \(k - 3 = 2\) variables being pivoted, so we expect a new dataframe to have \(n * 2 = 12\) rows x \(3 + 2 = 5\) columns.

Challenge: Describe the final dimensions

The final dimensions will be 54 rows and 3 columns.

Challenge: Pivot the Chosen Data

Code
animal_weights_simplified <- pivot_longer(animalweights, col = -`IPCC Area`, names_to = "Livestock", values_to = "Weight")
animal_weights_simplified
# A tibble: 144 × 3
   `IPCC Area`         Livestock          Weight
   <chr>               <chr>               <dbl>
 1 Indian Subcontinent Cattle - dairy      275  
 2 Indian Subcontinent Cattle - non-dairy  110  
 3 Indian Subcontinent Buffaloes           295  
 4 Indian Subcontinent Swine - market       28  
 5 Indian Subcontinent Swine - breeding     28  
 6 Indian Subcontinent Chicken - Broilers    0.9
 7 Indian Subcontinent Chicken - Layers      1.8
 8 Indian Subcontinent Ducks                 2.7
 9 Indian Subcontinent Turkeys               6.8
10 Indian Subcontinent Sheep                28  
# … with 134 more rows
Code
dim(animal_weights_simplified)
[1] 144   3

The new case will be “Livestock”.

Source Code
---
title: "Challenge 3 Instructions"
author: "Kim Darkenwald"
desription: "Tidy Data: Pivoting"
date: "10/18/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_3
  - animal_weights
  - eggs
  - australian_marriage
  - usa_households
  - sce_labor
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)
library(readr)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

## Challenge Overview

Today's challenge is to:

1.  read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
2.  identify what needs to be done to tidy the current data
3.  anticipate the shape of pivoted data
4.  pivot the data into tidy format using `pivot_longer`

## Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

-   animal_weights.csv ⭐
-   eggs_tidy.csv ⭐⭐ or organiceggpoultry.xls ⭐⭐⭐
-   australian_marriage\*.xls ⭐⭐⭐
-   USA Households\*.xlsx ⭐⭐⭐⭐
-   sce_labor_chart_data_public.xlsx 🌟🌟🌟🌟🌟

```{r}
animalweights <- read_csv("_data/animal_weight.csv")
animalweights

dim(animalweights)

colnames(animalweights)




```

### Briefly describe the data

As indicated in our data,for the most part, animals share similar weights around regions of the globe. However, when it comes to buffalo, cattle, and swine, there are distinct differences in weight.  Animals of these categories in particular appear to be much larger in weight in the Northern American and European regions while the regions of the Middle East, Africa, and the Indian Subcontinent contain animals of significantly less weight. 


Suppose you have a dataset with $n$ rows and $k$ variables. In our example, 3 of the variables are used to identify a case, so you will be pivoting $k-3$ variables into a longer format where the $k-3$ variable names will move into the `names_to` variable and the current values in each of those columns will move into the `values_to` variable. Therefore, we would expect $n * (k-3)$ rows in the pivoted dataframe!

### Example: find current and future data dimensions

Lets see if this works with a simple example.

```{r}
#| tbl-cap: Example


#existing rows/cases
nrow(df)  = 16

#existing columns/cases
ncol(df) = 9

#expected rows/cases
nrow(df) * (ncol(df)-3) = 17 * (9-3)
= 16 * 8



# expected columns 

144


```

Or simple example has $n = 6$ rows and $k - 3 = 2$ variables being pivoted, so we expect a new dataframe to have $n * 2 = 12$ rows x $3 + 2 = 5$ columns.

### Challenge: Describe the final dimensions

The final dimensions will be 54 rows and 3 columns.

### Challenge: Pivot the Chosen Data

```{r}
animal_weights_simplified <- pivot_longer(animalweights, col = -`IPCC Area`, names_to = "Livestock", values_to = "Weight")
animal_weights_simplified

```


```{r}
#| tbl-cap: Pivoted Example

dim(animal_weights_simplified)
```
The new case will be "Livestock".