DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 3

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Challenge Overview
  • Read in data
    • Briefly describe the data
  • Anticipate the End Result
    • Example: find current and future data dimensions
    • Challenge: Describe the final dimensions
  • Pivot the Data
    • Challenge: Pivot the Chosen Data

Challenge 3

  • Show All Code
  • Hide All Code

  • View Source
challenge_3
animal_weights
eggs
australian_marriage
usa_households
sce_labor
Author

Siddharth Nammara Kalyana Raman

Published

December 2, 2022

Code
library(tidyverse)
library(summarytools)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. identify what needs to be done to tidy the current data
  3. anticipate the shape of pivoted data
  4. pivot the data into tidy format using pivot_longer

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

  • animal_weights.csv ⭐
  • eggs_tidy.csv ⭐⭐ or organiceggpoultry.xls ⭐⭐⭐
  • australian_marriage*.xls ⭐⭐⭐
  • USA Households*.xlsx ⭐⭐⭐⭐
  • sce_labor_chart_data_public.xlsx 🌟🌟🌟🌟🌟
Code
#read data and name it eggs_data
eggs_data<-read.csv('_data/eggs_tidy.csv')
eggs_data
        month year large_half_dozen large_dozen extra_large_half_dozen
1     January 2004          126.000     230.000                132.000
2    February 2004          128.500     226.250                134.500
3       March 2004          131.000     225.000                137.000
4       April 2004          131.000     225.000                137.000
5         May 2004          131.000     225.000                137.000
6        June 2004          133.500     231.375                137.000
7        July 2004          133.500     233.500                137.000
8      August 2004          133.500     233.500                137.000
9   September 2004          129.750     233.500                135.875
10    October 2004          128.500     233.500                135.500
11   November 2004          128.500     233.500                135.500
12   December 2004          128.500     233.500                135.500
13    January 2005          128.500     233.500                135.500
14   February 2005          128.500     233.500                135.500
15      March 2005          128.500     233.500                135.500
16      April 2005          128.500     233.500                135.500
17        May 2005          128.500     233.500                135.500
18       June 2005          128.500     233.500                135.500
19       July 2005          128.500     233.500                135.500
20     August 2005          128.500     233.500                135.500
21  September 2005          128.500     233.500                135.500
22    October 2005          128.500     233.500                135.500
23   November 2005          128.500     233.500                135.500
24   December 2005          128.500     233.500                135.500
25    January 2006          128.500     233.500                135.500
26   February 2006          128.500     233.500                135.500
27      March 2006          128.500     233.500                135.500
28      April 2006          128.500     233.500                135.500
29        May 2006          128.500     233.500                135.500
30       June 2006          128.500     233.500                135.500
31       July 2006          128.500     233.500                135.500
32     August 2006          128.500     233.500                135.500
33  September 2006          128.500     233.500                135.500
34    October 2006          128.500     233.500                135.500
35   November 2006          128.500     233.500                135.500
36   December 2006          128.500     233.500                135.500
37    January 2007          128.500     233.500                135.500
38   February 2007          131.125     236.125                138.125
39      March 2007          132.000     237.000                139.000
40      April 2007          132.000     237.000                139.000
41        May 2007          132.000     237.000                139.000
42       June 2007          132.000     237.000                139.000
43       July 2007          132.000     237.000                139.000
44     August 2007          132.000     237.000                139.000
45  September 2007          132.000     237.000                139.000
46    October 2007          132.000     237.000                139.000
47   November 2007          132.000     237.000                139.000
48   December 2007          132.000     237.000                139.000
49    January 2008          132.000     237.000                139.000
50   February 2008          132.000     237.000                139.000
51      March 2008          132.000     237.000                139.000
52      April 2008          132.000     237.000                139.000
53        May 2008          132.000     237.000                139.000
54       June 2008          174.500     277.500                185.500
55       July 2008          174.500     277.500                185.500
56     August 2008          174.500     277.500                185.500
57  September 2008          174.500     277.500                185.500
58    October 2008          174.500     277.500                185.500
59   November 2008          174.500     277.500                185.500
60   December 2008          174.500     277.500                185.500
61    January 2009          174.500     277.500                185.500
62   February 2009          174.500     277.500                185.500
63      March 2009          174.500     277.500                185.500
64      April 2009          174.500     277.500                185.500
65        May 2009          174.500     277.500                185.500
66       June 2009          174.500     277.500                185.500
67       July 2009          174.500     277.500                185.500
68     August 2009          174.500     271.500                185.500
69  September 2009          174.500     271.500                185.500
70    October 2009          174.500     271.500                185.500
71   November 2009          174.500     271.500                185.500
72   December 2009          174.500     271.500                185.500
73    January 2010          174.500     271.500                185.500
74   February 2010          174.500     271.500                185.500
75      March 2010          174.500     268.000                185.500
76      April 2010          174.500     268.000                185.500
77        May 2010          174.500     268.000                185.500
78       June 2010          174.500     268.000                185.500
79       July 2010          174.500     268.000                185.500
80     August 2010          174.500     268.000                185.500
81  September 2010          174.500     268.000                185.500
82    October 2010          174.500     267.500                185.500
83   November 2010          174.500     267.500                185.500
84   December 2010          174.500     267.500                185.500
85    January 2011          174.500     267.500                185.500
86   February 2011          174.500     267.500                185.500
87      March 2011          174.500     267.500                185.500
88      April 2011          174.500     267.500                185.500
89        May 2011          174.500     267.500                185.500
90       June 2011          174.500     270.000                185.500
91       July 2011          174.500     270.000                185.500
92     August 2011          174.500     270.000                185.500
93  September 2011          174.500     270.000                185.500
94    October 2011          174.500     270.000                185.500
95   November 2011          174.500     270.000                185.500
96   December 2011          174.500     270.000                185.500
97    January 2012          174.500     267.500                185.500
98   February 2012          174.500     267.500                185.500
99      March 2012          174.500     267.500                185.500
100     April 2012          174.500     267.500                185.500
101       May 2012          173.250     267.500                185.500
102      June 2012          173.250     267.500                185.500
103      July 2012          173.250     267.500                185.500
104    August 2012          173.250     267.500                185.500
105 September 2012          173.250     267.500                185.500
106   October 2012          173.250     267.500                185.500
107  November 2012          178.000     267.500                188.130
108  December 2012          178.000     267.500                188.130
109   January 2013          178.000     267.500                188.130
110  February 2013          178.000     267.500                188.130
111     March 2013          178.000     267.500                188.130
112     April 2013          178.000     267.500                188.130
113       May 2013          178.000     267.500                188.130
114      June 2013          178.000     267.500                188.130
115      July 2013          178.000     267.500                188.130
116    August 2013          178.000     267.500                188.130
117 September 2013          178.000     267.500                188.130
118   October 2013          178.000     267.500                188.130
119  November 2013          178.000     267.500                188.130
120  December 2013          178.000     267.500                188.130
    extra_large_dozen
1             230.000
2             230.000
3             230.000
4             234.500
5             236.000
6             241.000
7             241.000
8             241.000
9             241.000
10            241.000
11            241.000
12            241.000
13            241.000
14            241.000
15            241.000
16            241.000
17            241.000
18            241.000
19            241.000
20            241.000
21            241.000
22            241.000
23            241.000
24            241.000
25            241.000
26            241.000
27            241.375
28            241.500
29            241.500
30            241.500
31            241.500
32            241.500
33            241.500
34            241.500
35            241.500
36            241.500
37            241.500
38            244.125
39            245.000
40            245.000
41            245.000
42            245.000
43            245.000
44            245.000
45            245.000
46            245.000
47            245.000
48            245.000
49            245.000
50            245.000
51            245.000
52            245.000
53            245.000
54            285.500
55            285.500
56            285.500
57            285.500
58            285.500
59            285.500
60            285.500
61            285.500
62            285.500
63            285.500
64            285.500
65            285.500
66            285.500
67            285.500
68            285.500
69            285.500
70            285.500
71            285.500
72            285.500
73            285.500
74            285.500
75            285.500
76            285.500
77            285.500
78            285.500
79            285.500
80            285.500
81            285.500
82            285.500
83            285.500
84            285.500
85            285.500
86            285.500
87            285.500
88            285.500
89            285.500
90            285.500
91            285.500
92            285.500
93            285.500
94            285.500
95            285.500
96            285.500
97            285.500
98            288.500
99            288.500
100           288.500
101           288.500
102           288.500
103           288.500
104           288.500
105           288.500
106           288.500
107           290.000
108           290.000
109           290.000
110           290.000
111           290.000
112           290.000
113           290.000
114           290.000
115           290.000
116           290.000
117           290.000
118           290.000
119           290.000
120           290.000

Briefly describe the data

Describe the data, and be sure to comment on why you are planning to pivot it to make it “tidy”

Code
print(dfSummary(eggs_data))
Data Frame Summary  
eggs_data  
Dimensions: 120 x 6  
Duplicates: 0  

-------------------------------------------------------------------------------------------------------------------------
No   Variable                 Stats / Values             Freqs (% of Valid)    Graph                 Valid      Missing  
---- ------------------------ -------------------------- --------------------- --------------------- ---------- ---------
1    month                    1. April                   10 ( 8.3%)            I                     120        0        
     [character]              2. August                  10 ( 8.3%)            I                     (100.0%)   (0.0%)   
                              3. December                10 ( 8.3%)            I                                         
                              4. February                10 ( 8.3%)            I                                         
                              5. January                 10 ( 8.3%)            I                                         
                              6. July                    10 ( 8.3%)            I                                         
                              7. June                    10 ( 8.3%)            I                                         
                              8. March                   10 ( 8.3%)            I                                         
                              9. May                     10 ( 8.3%)            I                                         
                              10. November               10 ( 8.3%)            I                                         
                              [ 2 others ]               20 (16.7%)            III                                       

2    year                     Mean (sd) : 2008.5 (2.9)   2004 : 12 (10.0%)     II                    120        0        
     [integer]                min < med < max:           2005 : 12 (10.0%)     II                    (100.0%)   (0.0%)   
                              2004 < 2008.5 < 2013       2006 : 12 (10.0%)     II                                        
                              IQR (CV) : 5 (0)           2007 : 12 (10.0%)     II                                        
                                                         2008 : 12 (10.0%)     II                                        
                                                         2009 : 12 (10.0%)     II                                        
                                                         2010 : 12 (10.0%)     II                                        
                                                         2011 : 12 (10.0%)     II                                        
                                                         2012 : 12 (10.0%)     II                                        
                                                         2013 : 12 (10.0%)     II                                        

3    large_half_dozen         Mean (sd) : 155.2 (22.6)   126.00 :  1 ( 0.8%)                         120        0        
     [numeric]                min < med < max:           128.50 : 29 (24.2%)   IIII                  (100.0%)   (0.0%)   
                              126 < 174.5 < 178          129.75 :  1 ( 0.8%)                                             
                              IQR (CV) : 45.1 (0.1)      131.00 :  3 ( 2.5%)                                             
                                                         131.12!:  1 ( 0.8%)                                             
                                                         132.00 : 15 (12.5%)   II                                        
                                                         133.50 :  3 ( 2.5%)                                             
                                                         173.25 :  6 ( 5.0%)   I                                         
                                                         174.50 : 47 (39.2%)   IIIIIII                                   
                                                         178.00 : 14 (11.7%)   II                                        
                                                         ! rounded                                                       

4    large_dozen              Mean (sd) : 254.2 (18.5)   12 distinct values                    :     120        0        
     [numeric]                min < med < max:                                                 :     (100.0%)   (0.0%)   
                              225 < 267.5 < 277.5                                :             :                         
                              IQR (CV) : 34.5 (0.1)                              : .           : .                       
                                                                               . : :           : :                       

5    extra_large_half_dozen   Mean (sd) : 164.2 (24.7)   132.00 :  1 ( 0.8%)                         120        0        
     [numeric]                min < med < max:           134.50 :  1 ( 0.8%)                         (100.0%)   (0.0%)   
                              132 < 185.5 < 188.1        135.50 : 28 (23.3%)   IIII                                      
                              IQR (CV) : 49.7 (0.2)      135.88!:  1 ( 0.8%)                                             
                                                         137.00 :  6 ( 5.0%)   I                                         
                                                         138.12!:  1 ( 0.8%)                                             
                                                         139.00 : 15 (12.5%)   II                                        
                                                         185.50 : 53 (44.2%)   IIIIIIII                                  
                                                         188.13 : 14 (11.7%)   II                                        
                                                         ! rounded                                                       

6    extra_large_dozen        Mean (sd) : 266.8 (22.8)   11 distinct values              :           120        0        
     [numeric]                min < med < max:                                   .       :           (100.0%)   (0.0%)   
                              230 < 285.5 < 290                                  :       :                               
                              IQR (CV) : 44 (0.1)                                :       :                               
                                                                               . :       :                               
-------------------------------------------------------------------------------------------------------------------------
Code
# to know the statistics of the eggs dataset
summary(eggs_data)
    month                year      large_half_dozen  large_dozen   
 Length:120         Min.   :2004   Min.   :126.0    Min.   :225.0  
 Class :character   1st Qu.:2006   1st Qu.:129.4    1st Qu.:233.5  
 Mode  :character   Median :2008   Median :174.5    Median :267.5  
                    Mean   :2008   Mean   :155.2    Mean   :254.2  
                    3rd Qu.:2011   3rd Qu.:174.5    3rd Qu.:268.0  
                    Max.   :2013   Max.   :178.0    Max.   :277.5  
 extra_large_half_dozen extra_large_dozen
 Min.   :132.0          Min.   :230.0    
 1st Qu.:135.8          1st Qu.:241.5    
 Median :185.5          Median :285.5    
 Mean   :164.2          Mean   :266.8    
 3rd Qu.:185.5          3rd Qu.:285.5    
 Max.   :188.1          Max.   :290.0    
Code
# review the first few rows of the dataset
head(eggs_data)
     month year large_half_dozen large_dozen extra_large_half_dozen
1  January 2004            126.0     230.000                  132.0
2 February 2004            128.5     226.250                  134.5
3    March 2004            131.0     225.000                  137.0
4    April 2004            131.0     225.000                  137.0
5      May 2004            131.0     225.000                  137.0
6     June 2004            133.5     231.375                  137.0
  extra_large_dozen
1             230.0
2             230.0
3             230.0
4             234.5
5             236.0
6             241.0

We can see from the rendered data frame summary that there are 6 columns and 120 rows of data, covering the years 2004 through 2013 as well as all 12 months of the year. The remaining 4 columns, with values ranging from 126-290, are large half dozen, large dozen, extra large half dozen, and extra large dozen.

Anticipate the End Result

The first step in pivoting the data is to try to come up with a concrete vision of what the end product should look like - that way you will know whether or not your pivoting was successful.

One easy way to do this is to think about the dimensions of your current data (tibble, dataframe, or matrix), and then calculate what the dimensions of the pivoted data should be.

Suppose you have a dataset with \(n\) rows and \(k\) variables. In our example, 3 of the variables are used to identify a case, so you will be pivoting \(k-3\) variables into a longer format where the \(k-3\) variable names will move into the names_to variable and the current values in each of those columns will move into the values_to variable. Therefore, we would expect \(n * (k-3)\) rows in the pivoted dataframe!

Example: find current and future data dimensions

Code
#Finding the dimensions before pivoting
#number of rows
nrow(eggs_data)
[1] 120
Code
#number of columns
ncol(eggs_data)
[1] 6
Code
#expected dimensions after pivot

#expected number of rows
nrow(eggs_data)*(ncol(eggs_data)-2)
[1] 480
Code
#expected number of cols
3+2
[1] 5

Or simple example has \(n = 6\) rows and \(k - 3 = 2\) variables being pivoted, so we expect a new dataframe to have \(n * 2 = 12\) rows x \(3 + 2 = 5\) columns.

Challenge: Describe the final dimensions

Document your work here.

Due to the fact that each size-quantity pairing will have its own row, I anticipate the eventual data set to be four times longer (rather than the initial 4 columns after month and year). Since I will delete the names of the four size-quantity columns and replace them with one column each for size, quantity, and average price, I anticipate that the overall number of columns will decrease by one. For this newly reorganized data, see below.

Any additional comments?

Pivot the Data

Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a “sanity” check.

Code
eggs_longer<- eggs_data%>%
  pivot_longer(cols=contains("large"),
               names_to = c("size", "quantity"),
               names_sep="_",
               values_to = "amount"
  )
eggs_longer
# A tibble: 480 × 5
   month     year size  quantity amount
   <chr>    <int> <chr> <chr>     <dbl>
 1 January   2004 large half       126 
 2 January   2004 large dozen      230 
 3 January   2004 extra large      132 
 4 January   2004 extra large      230 
 5 February  2004 large half       128.
 6 February  2004 large dozen      226.
 7 February  2004 extra large      134.
 8 February  2004 extra large      230 
 9 March     2004 large half       131 
10 March     2004 large dozen      225 
# … with 470 more rows
Code
#number of rows
nrow(eggs_longer)
[1] 480
Code
#number of cols
ncol(eggs_longer)
[1] 5

Challenge: Pivot the Chosen Data

Document your work here. What will a new “case” be once you have pivoted the data? How does it meet requirements for tidy data?

As expected, I now have one less column and four times as many rows (120 became 480). (6 became 5). I now have one observation for each row and am prepared to start my analysis. Depending on the nature of my study or if there are more tables to put in at some time, I might want to change the cents into dollars. The data keeps the precise decimals even when the preview tibble below rounds to whole cents.

Code
eggs_price <- mutate(eggs_longer, 
       avg_USD = amount / 100
       )%>%
  select(!contains ("amount"))
eggs_price
# A tibble: 480 × 5
   month     year size  quantity avg_USD
   <chr>    <int> <chr> <chr>      <dbl>
 1 January   2004 large half        1.26
 2 January   2004 large dozen       2.3 
 3 January   2004 extra large       1.32
 4 January   2004 extra large       2.3 
 5 February  2004 large half        1.28
 6 February  2004 large dozen       2.26
 7 February  2004 extra large       1.34
 8 February  2004 extra large       2.3 
 9 March     2004 large half        1.31
10 March     2004 large dozen       2.25
# … with 470 more rows

Any additional comments?

Source Code
---
title: "Challenge 3"
author: "Siddharth Nammara Kalyana Raman"
desription: "Tidy Data: Pivoting"
date: "12/02/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_3
  - animal_weights
  - eggs
  - australian_marriage
  - usa_households
  - sce_labor
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)
library(summarytools)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

## Challenge Overview

Today's challenge is to:

1.  read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
2.  identify what needs to be done to tidy the current data
3.  anticipate the shape of pivoted data
4.  pivot the data into tidy format using `pivot_longer`

## Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

-   animal_weights.csv ⭐
-   eggs_tidy.csv ⭐⭐ or organiceggpoultry.xls ⭐⭐⭐
-   australian_marriage\*.xls ⭐⭐⭐
-   USA Households\*.xlsx ⭐⭐⭐⭐
-   sce_labor_chart_data_public.xlsx 🌟🌟🌟🌟🌟

```{r}
#read data and name it eggs_data
eggs_data<-read.csv('_data/eggs_tidy.csv')
eggs_data

```

### Briefly describe the data

Describe the data, and be sure to comment on why you are planning to pivot it to make it "tidy"

```{r}
print(dfSummary(eggs_data))


```

```{r}
# to know the statistics of the eggs dataset
summary(eggs_data)

```

```{r}
# review the first few rows of the dataset
head(eggs_data)

```
We can see from the rendered data frame summary that there are 6 columns and 120 rows of data, covering the years 2004 through 2013 as well as all 12 months of the year. The remaining 4 columns, with values ranging from 126-290, are large half dozen, large dozen, extra large half dozen, and extra large dozen. 

## Anticipate the End Result

The first step in pivoting the data is to try to come up with a concrete vision of what the end product *should* look like - that way you will know whether or not your pivoting was successful.

One easy way to do this is to think about the dimensions of your current data (tibble, dataframe, or matrix), and then calculate what the dimensions of the pivoted data should be.

Suppose you have a dataset with $n$ rows and $k$ variables. In our example, 3 of the variables are used to identify a case, so you will be pivoting $k-3$ variables into a longer format where the $k-3$ variable names will move into the `names_to` variable and the current values in each of those columns will move into the `values_to` variable. Therefore, we would expect $n * (k-3)$ rows in the pivoted dataframe!


### Example: find current and future data dimensions

```{r}
#Finding the dimensions before pivoting
#number of rows
nrow(eggs_data)

#number of columns
ncol(eggs_data)
```
```{r}
#expected dimensions after pivot

#expected number of rows
nrow(eggs_data)*(ncol(eggs_data)-2)

#expected number of cols
3+2
```


Or simple example has $n = 6$ rows and $k - 3 = 2$ variables being pivoted, so we expect a new dataframe to have $n * 2 = 12$ rows x $3 + 2 = 5$ columns.

### Challenge: Describe the final dimensions

Document your work here.

Due to the fact that each size-quantity pairing will have its own row, I anticipate the eventual data set to be four times longer (rather than the initial 4 columns after month and year). Since I will delete the names of the four size-quantity columns and replace them with one column each for size, quantity, and average price, I anticipate that the overall number of columns will decrease by one. For this newly reorganized data, see below.


Any additional comments?

## Pivot the Data

Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a "sanity" check.


```{r}
#| tbl-cap: Pivoted Example

eggs_longer<- eggs_data%>%
  pivot_longer(cols=contains("large"),
               names_to = c("size", "quantity"),
               names_sep="_",
               values_to = "amount"
  )
eggs_longer
#number of rows
nrow(eggs_longer)
#number of cols
ncol(eggs_longer)
```


### Challenge: Pivot the Chosen Data

Document your work here. What will a new "case" be once you have pivoted the data? How does it meet requirements for tidy data?

As expected, I now have one less column and four times as many rows (120 became 480). (6 became 5). I now have one observation for each row and am prepared to start my analysis. Depending on the nature of my study or if there are more tables to put in at some time, I might want to change the cents into dollars. The data keeps the precise decimals even when the preview tibble below rounds to whole cents.
```{r}
eggs_price <- mutate(eggs_longer, 
       avg_USD = amount / 100
       )%>%
  select(!contains ("amount"))
eggs_price

```

Any additional comments?