Tidy a dataset

challenge3

Neha Jhurani

eggs_tidy.csv

Author

Neha Jhurani

Published

April 12, 2023

Code

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE)

Analysing eggs dataset

Code

library(readr)

#reading eggs_tidy csv data
eggs_tidy_data <- read_csv("_data/eggs_tidy.csv")

Rows: 120 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): month
dbl (5): year, large_half_dozen, large_dozen, extra_large_half_dozen, extra_...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Code

#extracting all the column names
colnames(eggs_tidy_data)

[1] "month"                  "year"                   "large_half_dozen"      
[4] "large_dozen"            "extra_large_half_dozen" "extra_large_dozen"

Code

#getting the summary (minimum, 1st quartile, median, mean, 3rd quartile, maximum, number of NA's present) of each column in eggs tidy dataset
summary(eggs_tidy_data)

    month                year      large_half_dozen  large_dozen   
 Length:120         Min.   :2004   Min.   :126.0    Min.   :225.0  
 Class :character   1st Qu.:2006   1st Qu.:129.4    1st Qu.:233.5  
 Mode  :character   Median :2008   Median :174.5    Median :267.5  
                    Mean   :2008   Mean   :155.2    Mean   :254.2  
                    3rd Qu.:2011   3rd Qu.:174.5    3rd Qu.:268.0  
                    Max.   :2013   Max.   :178.0    Max.   :277.5  
 extra_large_half_dozen extra_large_dozen
 Min.   :132.0          Min.   :230.0    
 1st Qu.:135.8          1st Qu.:241.5    
 Median :185.5          Median :285.5    
 Mean   :164.2          Mean   :266.8    
 3rd Qu.:185.5          3rd Qu.:285.5    
 Max.   :188.1          Max.   :290.0

Code

#Note: There are no NA's present in any column (confirmed from above result)

dim(eggs_tidy_data)

[1] 120   6

Code

# The dataset contains 120 rows and 6 columns

head(eggs_tidy_data)

# A tibble: 6 × 6
  month     year large_half_dozen large_dozen extra_large_half_dozen extra_lar…¹
  <chr>    <dbl>            <dbl>       <dbl>                  <dbl>       <dbl>
1 January   2004             126         230                    132         230 
2 February  2004             128.        226.                   134.        230 
3 March     2004             131         225                    137         230 
4 April     2004             131         225                    137         234.
5 May       2004             131         225                    137         236 
6 June      2004             134.        231.                   137         241 
# … with abbreviated variable name ¹extra_large_dozen

Code

#We see that the dataset contains monthly data for ten years, i.e., from Jan 2004 to Dec 2013. It stores the average volumne of 6 different types of cartons of eggs. We know that the values are average because a unit of carton is not sold partially, but we still have few decimal point values in the dataset. For example - Feb 2004 has 128.5 large half dozen sized cartons. 

#The following rules were explained for the data to be 'tidy'.
# 1. Each variable must have it's own column
# 2. Each observation must have it's own row.
# 3. Each value must have it's own cell.

# Analysis - The four data points stored in the dataset are a) month, b) year, c) carton_type, and d) units. We see that each of the variables have their own column, but against rule 2, the dataset includes columns which represent observations, i.e., large half dozen, large dozen, extra large half dozen and extra large dozen will no longer have it's own individual column. To do this, we will introduce a column name as 'carton_type' which will hold the info regarding the carton and we will store it's respective average value under 'units' column

#eggs dataset will be made 'Tidy' by pivoting it to have more rows and fewer columns. The columns that needs to be pivoted are - large half dozen, large dozen, extra large half dozen and extra large dozen, they will become rows for the corresponding month and year of obeservation. We will use tidyverse function, pivot_longer to accomplish this

pivoted_eggs_tidy_data <- eggs_tidy_data %>%
  pivot_longer(
    cols = ends_with("dozen"),
    names_to = "carton_type",
    values_to = "units")

view(pivoted_eggs_tidy_data)

dim(pivoted_eggs_tidy_data)

[1] 480   4

Code

# The dataset contains 480 rows and 4 columns