DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 3 Guanhua Tan

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Challenge Overview
  • Read in data
    • Briefly describe the data
  • Anticipate the End Result
    • Challenge: Describe the final dimensions
    • Challenge: Pivot the Chosen Data

Challenge 3 Guanhua Tan

  • Show All Code
  • Hide All Code

  • View Source
challenge_3
eggs
Author

Guanhua Tan

Published

September 24, 2022

Code
library(tidyverse)
library(readxl) 

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. identify what needs to be done to tidy the current data
  3. anticipate the shape of pivoted data
  4. pivot the data into tidy format using pivot_longer

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

  • animal_weights.csv ⭐
  • eggs_tidy.csv ⭐⭐ or organiceggpoultry.xls ⭐⭐⭐
  • australian_marriage*.xls ⭐⭐⭐
  • USA Households*.xlsx ⭐⭐⭐⭐
  • sce_labor_chart_data_public.xlsx 🌟🌟🌟🌟🌟
Code
organic_egg<-read_excel("_data/organiceggpoultry.xls")
view(organic_egg)
nrow(organic_egg)
[1] 124
Code
ncol(organic_egg)
[1] 11
Code
nrow(organic_egg)*ncol(organic_egg)
[1] 1364

Briefly describe the data

Describe the data, and be sure to comment on why you are planning to pivot it to make it “tidy”

This dataset has two different parts of information with the same timeline covering from 2004 to 2012. One part is about the prices of one dozen and 1/2 dozen of large eggs and extra large eggs. The other part is about the prices of “whole”, “B/S Breast”, “Bone-in Breast”, “Whole Legs”, “Thighs”–different parts of a Chicken. In order to clean, I would like to divide it into two datesets. One just includes the information about the eggs and the other the data about the Chicken. Additionally, I’d like to separate “Year” and “Month” to make very cell have its own value.

Anticipate the End Result

Challenge: Describe the final dimensions

I’ll divide the existing dataset into two distinct dataset. Existing data has 124 rows, 11 columns and 1364 cases. Expected data 1 will have 118 rows, 6 columns and 708 cases. Expected data 2 will have 118 rows, 7 columns and 826 cases.

Code
#create a new dataset about the eggs
column_names <-c("Date", "Extra_Large 12","Extra_Large 6","Large 12", "Large 6")
eggs1  <- organic_egg[5:122, 1:5]
colnames(eggs1) =column_names
eggs1
# A tibble: 118 × 5
   Date      `Extra_Large 12` `Extra_Large 6` `Large 12` `Large 6`
   <chr>     <chr>            <chr>           <chr>      <chr>    
 1 Jan 2004  230              132             230        126      
 2 February  230              134.5           226.25     128.5    
 3 March     230              137             225        131      
 4 April     234.5            137             225        131      
 5 May       236              137             225        131      
 6 June      241              137             231.375    133.5    
 7 July      241              137             233.5      133.5    
 8 August    241              137             233.5      133.5    
 9 September 241              135.875         233.5      129.75   
10 October   241              135.5           233.5      128.5    
# … with 108 more rows
Code
# separate year and month and remove "/1"
eggs2 <-eggs1  %>%
  mutate(Date = str_remove(Date, " /1")) %>%
  separate(Date, into=c("Month", "Year"), sep=" ") %>%
  fill(Year) %>%
  select(Year, Month, `Extra_Large 12`, `Extra_Large 6`, `Large 12`, `Large 6`)
eggs2
# A tibble: 118 × 6
   Year  Month     `Extra_Large 12` `Extra_Large 6` `Large 12` `Large 6`
   <chr> <chr>     <chr>            <chr>           <chr>      <chr>    
 1 2004  Jan       230              132             230        126      
 2 2004  February  230              134.5           226.25     128.5    
 3 2004  March     230              137             225        131      
 4 2004  April     234.5            137             225        131      
 5 2004  May       236              137             225        131      
 6 2004  June      241              137             231.375    133.5    
 7 2004  July      241              137             233.5      133.5    
 8 2004  August    241              137             233.5      133.5    
 9 2004  September 241              135.875         233.5      129.75   
10 2004  October   241              135.5           233.5      128.5    
# … with 108 more rows

Challenge: Pivot the Chosen Data

Through the function pivot_longer, I reorganized the dataset. I created three new columns–categorizes, quantiy, and prices. I separated original column names with the whitespace. I moved the type description under the column name “categories” and the numbers (12 or 6) under the column name “quantity.” All values were moved to columns “prices.” New column names reflect the characteristics of all values that makes the dataset clean and clear.

Code
# pivot_longer
eggs2 %>%
  pivot_longer(cols = contains("Large"), 
               names_to = c("categroies", "quanity"),
               names_sep=" ",
               values_to= "prices")
# A tibble: 472 × 5
   Year  Month    categroies  quanity prices
   <chr> <chr>    <chr>       <chr>   <chr> 
 1 2004  Jan      Extra_Large 12      230   
 2 2004  Jan      Extra_Large 6       132   
 3 2004  Jan      Large       12      230   
 4 2004  Jan      Large       6       126   
 5 2004  February Extra_Large 12      230   
 6 2004  February Extra_Large 6       134.5 
 7 2004  February Large       12      226.25
 8 2004  February Large       6       128.5 
 9 2004  March    Extra_Large 12      230   
10 2004  March    Extra_Large 6       137   
# … with 462 more rows
Code
nrow(eggs2)
[1] 118
Code
ncol(eggs2)
[1] 6
Code
nrow(eggs2)*ncol(eggs2)
[1] 708
Code
# create a new dataset about the Chicken and tidy data
organic_egg<-read_excel("_data/organiceggpoultry.xls")
column_names <-c("Date", "Extra_Large 12","Extra_Large 6","Large 12", "Large 6", "NA", "Whole", "B/S Breast",
"Bone-in Breast", "Whole Legs", "Thighs")
Chicken<-organic_egg[5:122, 1:11]
colnames(Chicken) = column_names
Chicken1 <- Chicken %>%
  select(`Date`, `Whole`, `B/S Breast`, `Bone-in Breast`, `Whole Legs`, `Thighs`)
Chicken1
# A tibble: 118 × 6
   Date      Whole   `B/S Breast` `Bone-in Breast` `Whole Legs` Thighs 
   <chr>     <chr>   <chr>        <chr>            <chr>        <chr>  
 1 Jan 2004  197.5   645.5        too few          193.5        too few
 2 February  197.5   642.5        too few          193.5        203    
 3 March     209     642.5        too few          193.5        203    
 4 April     212     642.5        too few          193.5        203    
 5 May       214.5   642.5        too few          193.5        203    
 6 June      216.375 641          too few          201.875      200.375
 7 July      217     642.5        390.5            203.5        199.5  
 8 August    217     642.5        390.5            203.5        199.5  
 9 September 217     642.5        390.5            203.5        199.5  
10 October   217     642.5        390.5            203.5        199.5  
# … with 108 more rows
Code
#tidy data 2 Separate year and month, remove "/1"
Chicken_tidy <- Chicken1 %>%
  mutate(Date = str_remove(Date, "/1")) %>%
  separate(Date, into=c("Month", "Year"), sep=" ") %>%
  fill(Year)
Chicken_tidy
# A tibble: 118 × 7
   Month     Year  Whole   `B/S Breast` `Bone-in Breast` `Whole Legs` Thighs 
   <chr>     <chr> <chr>   <chr>        <chr>            <chr>        <chr>  
 1 Jan       2004  197.5   645.5        too few          193.5        too few
 2 February  2004  197.5   642.5        too few          193.5        203    
 3 March     2004  209     642.5        too few          193.5        203    
 4 April     2004  212     642.5        too few          193.5        203    
 5 May       2004  214.5   642.5        too few          193.5        203    
 6 June      2004  216.375 641          too few          201.875      200.375
 7 July      2004  217     642.5        390.5            203.5        199.5  
 8 August    2004  217     642.5        390.5            203.5        199.5  
 9 September 2004  217     642.5        390.5            203.5        199.5  
10 October   2004  217     642.5        390.5            203.5        199.5  
# … with 108 more rows
Code
# find and replace "too few"
Chicken_tidy[Chicken_tidy== "too few"] <- NA
Chicken_tidy %>%
  select(Year, Month, Whole, `B/S Breast`, `Bone-in Breast`, `Whole Legs`, Thighs)
# A tibble: 118 × 7
   Year  Month     Whole   `B/S Breast` `Bone-in Breast` `Whole Legs` Thighs 
   <chr> <chr>     <chr>   <chr>        <chr>            <chr>        <chr>  
 1 2004  Jan       197.5   645.5        <NA>             193.5        <NA>   
 2 2004  February  197.5   642.5        <NA>             193.5        203    
 3 2004  March     209     642.5        <NA>             193.5        203    
 4 2004  April     212     642.5        <NA>             193.5        203    
 5 2004  May       214.5   642.5        <NA>             193.5        203    
 6 2004  June      216.375 641          <NA>             201.875      200.375
 7 2004  July      217     642.5        390.5            203.5        199.5  
 8 2004  August    217     642.5        390.5            203.5        199.5  
 9 2004  September 217     642.5        390.5            203.5        199.5  
10 2004  October   217     642.5        390.5            203.5        199.5  
# … with 108 more rows

Finally, I got two different datasets that clearly reflect two distinct datas.

Source Code
---
title: "Challenge 3 Guanhua Tan"
author: "Guanhua Tan"
desription: "Tidy Data: Pivoting"
date: "09/24/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_3
  - eggs
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)
library(readxl) 

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

## Challenge Overview

Today's challenge is to:

1.  read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
2.  identify what needs to be done to tidy the current data
3.  anticipate the shape of pivoted data
4.  pivot the data into tidy format using `pivot_longer`

## Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

-   animal_weights.csv ⭐
-   eggs_tidy.csv ⭐⭐ or organiceggpoultry.xls ⭐⭐⭐
-   australian_marriage\*.xls ⭐⭐⭐
-   USA Households\*.xlsx ⭐⭐⭐⭐
-   sce_labor_chart_data_public.xlsx 🌟🌟🌟🌟🌟

```{r}
organic_egg<-read_excel("_data/organiceggpoultry.xls")
view(organic_egg)
nrow(organic_egg)
ncol(organic_egg)
nrow(organic_egg)*ncol(organic_egg)
```

### Briefly describe the data

Describe the data, and be sure to comment on why you are planning to pivot it to make it "tidy"

This dataset has two different parts of information with the same timeline covering from 2004 to 2012. One part is about the prices of one dozen and 1/2 dozen of large eggs and extra large eggs. The other part is about the prices of "whole", "B/S Breast", "Bone-in Breast", "Whole Legs", "Thighs"--different parts of a Chicken. In order to clean, I would like to divide it into two datesets. One just includes the information about the eggs and the other the data about the Chicken. Additionally, I'd like to separate "Year" and "Month" to make very cell have its own value.

## Anticipate the End Result
### Challenge: Describe the final dimensions

I'll divide the existing dataset into two distinct dataset. Existing data has 124 rows, 11 columns and 1364 cases. Expected data 1 will have 118 rows, 6 columns and 708 cases. Expected data 2 will have 118 rows, 7 columns and 826 cases.

```{r}
#create a new dataset about the eggs
column_names <-c("Date", "Extra_Large 12","Extra_Large 6","Large 12", "Large 6")
eggs1  <- organic_egg[5:122, 1:5]
colnames(eggs1) =column_names
eggs1
```



```{r}
# separate year and month and remove "/1"
eggs2 <-eggs1  %>%
  mutate(Date = str_remove(Date, " /1")) %>%
  separate(Date, into=c("Month", "Year"), sep=" ") %>%
  fill(Year) %>%
  select(Year, Month, `Extra_Large 12`, `Extra_Large 6`, `Large 12`, `Large 6`)
eggs2
```

### Challenge: Pivot the Chosen Data

Through the function pivot_longer, I reorganized the dataset. I created three new columns--categorizes, quantiy, and prices. I separated original column names with the whitespace. I moved the type description under the column name "categories" and the numbers (12 or 6) under the column name "quantity." All values were moved to columns "prices." New column names reflect the characteristics of all values that makes the dataset clean and clear.


```{r}
# pivot_longer
eggs2 %>%
  pivot_longer(cols = contains("Large"), 
               names_to = c("categroies", "quanity"),
               names_sep=" ",
               values_to= "prices")
nrow(eggs2)
ncol(eggs2)
nrow(eggs2)*ncol(eggs2)
```


```{r}
# create a new dataset about the Chicken and tidy data
organic_egg<-read_excel("_data/organiceggpoultry.xls")
column_names <-c("Date", "Extra_Large 12","Extra_Large 6","Large 12", "Large 6", "NA", "Whole", "B/S Breast",
"Bone-in Breast", "Whole Legs", "Thighs")
Chicken<-organic_egg[5:122, 1:11]
colnames(Chicken) = column_names
Chicken1 <- Chicken %>%
  select(`Date`, `Whole`, `B/S Breast`, `Bone-in Breast`, `Whole Legs`, `Thighs`)
Chicken1
```

```{r}
#tidy data 2 Separate year and month, remove "/1"
Chicken_tidy <- Chicken1 %>%
  mutate(Date = str_remove(Date, "/1")) %>%
  separate(Date, into=c("Month", "Year"), sep=" ") %>%
  fill(Year)
Chicken_tidy
  

```
```{r}
# find and replace "too few"
Chicken_tidy[Chicken_tidy== "too few"] <- NA
Chicken_tidy %>%
  select(Year, Month, Whole, `B/S Breast`, `Bone-in Breast`, `Whole Legs`, Thighs)

```

Finally, I got two different datasets that clearly reflect two distinct datas.