Code
library(tidyverse)
library(summarytools)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Siddharth Nammara Kalyana Raman
December 2, 2022
Today’s challenge is to:
pivot_longer
Read in one (or more) of the following datasets, using the correct R package and command.
month year large_half_dozen large_dozen extra_large_half_dozen
1 January 2004 126.000 230.000 132.000
2 February 2004 128.500 226.250 134.500
3 March 2004 131.000 225.000 137.000
4 April 2004 131.000 225.000 137.000
5 May 2004 131.000 225.000 137.000
6 June 2004 133.500 231.375 137.000
7 July 2004 133.500 233.500 137.000
8 August 2004 133.500 233.500 137.000
9 September 2004 129.750 233.500 135.875
10 October 2004 128.500 233.500 135.500
11 November 2004 128.500 233.500 135.500
12 December 2004 128.500 233.500 135.500
13 January 2005 128.500 233.500 135.500
14 February 2005 128.500 233.500 135.500
15 March 2005 128.500 233.500 135.500
16 April 2005 128.500 233.500 135.500
17 May 2005 128.500 233.500 135.500
18 June 2005 128.500 233.500 135.500
19 July 2005 128.500 233.500 135.500
20 August 2005 128.500 233.500 135.500
21 September 2005 128.500 233.500 135.500
22 October 2005 128.500 233.500 135.500
23 November 2005 128.500 233.500 135.500
24 December 2005 128.500 233.500 135.500
25 January 2006 128.500 233.500 135.500
26 February 2006 128.500 233.500 135.500
27 March 2006 128.500 233.500 135.500
28 April 2006 128.500 233.500 135.500
29 May 2006 128.500 233.500 135.500
30 June 2006 128.500 233.500 135.500
31 July 2006 128.500 233.500 135.500
32 August 2006 128.500 233.500 135.500
33 September 2006 128.500 233.500 135.500
34 October 2006 128.500 233.500 135.500
35 November 2006 128.500 233.500 135.500
36 December 2006 128.500 233.500 135.500
37 January 2007 128.500 233.500 135.500
38 February 2007 131.125 236.125 138.125
39 March 2007 132.000 237.000 139.000
40 April 2007 132.000 237.000 139.000
41 May 2007 132.000 237.000 139.000
42 June 2007 132.000 237.000 139.000
43 July 2007 132.000 237.000 139.000
44 August 2007 132.000 237.000 139.000
45 September 2007 132.000 237.000 139.000
46 October 2007 132.000 237.000 139.000
47 November 2007 132.000 237.000 139.000
48 December 2007 132.000 237.000 139.000
49 January 2008 132.000 237.000 139.000
50 February 2008 132.000 237.000 139.000
51 March 2008 132.000 237.000 139.000
52 April 2008 132.000 237.000 139.000
53 May 2008 132.000 237.000 139.000
54 June 2008 174.500 277.500 185.500
55 July 2008 174.500 277.500 185.500
56 August 2008 174.500 277.500 185.500
57 September 2008 174.500 277.500 185.500
58 October 2008 174.500 277.500 185.500
59 November 2008 174.500 277.500 185.500
60 December 2008 174.500 277.500 185.500
61 January 2009 174.500 277.500 185.500
62 February 2009 174.500 277.500 185.500
63 March 2009 174.500 277.500 185.500
64 April 2009 174.500 277.500 185.500
65 May 2009 174.500 277.500 185.500
66 June 2009 174.500 277.500 185.500
67 July 2009 174.500 277.500 185.500
68 August 2009 174.500 271.500 185.500
69 September 2009 174.500 271.500 185.500
70 October 2009 174.500 271.500 185.500
71 November 2009 174.500 271.500 185.500
72 December 2009 174.500 271.500 185.500
73 January 2010 174.500 271.500 185.500
74 February 2010 174.500 271.500 185.500
75 March 2010 174.500 268.000 185.500
76 April 2010 174.500 268.000 185.500
77 May 2010 174.500 268.000 185.500
78 June 2010 174.500 268.000 185.500
79 July 2010 174.500 268.000 185.500
80 August 2010 174.500 268.000 185.500
81 September 2010 174.500 268.000 185.500
82 October 2010 174.500 267.500 185.500
83 November 2010 174.500 267.500 185.500
84 December 2010 174.500 267.500 185.500
85 January 2011 174.500 267.500 185.500
86 February 2011 174.500 267.500 185.500
87 March 2011 174.500 267.500 185.500
88 April 2011 174.500 267.500 185.500
89 May 2011 174.500 267.500 185.500
90 June 2011 174.500 270.000 185.500
91 July 2011 174.500 270.000 185.500
92 August 2011 174.500 270.000 185.500
93 September 2011 174.500 270.000 185.500
94 October 2011 174.500 270.000 185.500
95 November 2011 174.500 270.000 185.500
96 December 2011 174.500 270.000 185.500
97 January 2012 174.500 267.500 185.500
98 February 2012 174.500 267.500 185.500
99 March 2012 174.500 267.500 185.500
100 April 2012 174.500 267.500 185.500
101 May 2012 173.250 267.500 185.500
102 June 2012 173.250 267.500 185.500
103 July 2012 173.250 267.500 185.500
104 August 2012 173.250 267.500 185.500
105 September 2012 173.250 267.500 185.500
106 October 2012 173.250 267.500 185.500
107 November 2012 178.000 267.500 188.130
108 December 2012 178.000 267.500 188.130
109 January 2013 178.000 267.500 188.130
110 February 2013 178.000 267.500 188.130
111 March 2013 178.000 267.500 188.130
112 April 2013 178.000 267.500 188.130
113 May 2013 178.000 267.500 188.130
114 June 2013 178.000 267.500 188.130
115 July 2013 178.000 267.500 188.130
116 August 2013 178.000 267.500 188.130
117 September 2013 178.000 267.500 188.130
118 October 2013 178.000 267.500 188.130
119 November 2013 178.000 267.500 188.130
120 December 2013 178.000 267.500 188.130
extra_large_dozen
1 230.000
2 230.000
3 230.000
4 234.500
5 236.000
6 241.000
7 241.000
8 241.000
9 241.000
10 241.000
11 241.000
12 241.000
13 241.000
14 241.000
15 241.000
16 241.000
17 241.000
18 241.000
19 241.000
20 241.000
21 241.000
22 241.000
23 241.000
24 241.000
25 241.000
26 241.000
27 241.375
28 241.500
29 241.500
30 241.500
31 241.500
32 241.500
33 241.500
34 241.500
35 241.500
36 241.500
37 241.500
38 244.125
39 245.000
40 245.000
41 245.000
42 245.000
43 245.000
44 245.000
45 245.000
46 245.000
47 245.000
48 245.000
49 245.000
50 245.000
51 245.000
52 245.000
53 245.000
54 285.500
55 285.500
56 285.500
57 285.500
58 285.500
59 285.500
60 285.500
61 285.500
62 285.500
63 285.500
64 285.500
65 285.500
66 285.500
67 285.500
68 285.500
69 285.500
70 285.500
71 285.500
72 285.500
73 285.500
74 285.500
75 285.500
76 285.500
77 285.500
78 285.500
79 285.500
80 285.500
81 285.500
82 285.500
83 285.500
84 285.500
85 285.500
86 285.500
87 285.500
88 285.500
89 285.500
90 285.500
91 285.500
92 285.500
93 285.500
94 285.500
95 285.500
96 285.500
97 285.500
98 288.500
99 288.500
100 288.500
101 288.500
102 288.500
103 288.500
104 288.500
105 288.500
106 288.500
107 290.000
108 290.000
109 290.000
110 290.000
111 290.000
112 290.000
113 290.000
114 290.000
115 290.000
116 290.000
117 290.000
118 290.000
119 290.000
120 290.000
Describe the data, and be sure to comment on why you are planning to pivot it to make it “tidy”
Data Frame Summary
eggs_data
Dimensions: 120 x 6
Duplicates: 0
-------------------------------------------------------------------------------------------------------------------------
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
---- ------------------------ -------------------------- --------------------- --------------------- ---------- ---------
1 month 1. April 10 ( 8.3%) I 120 0
[character] 2. August 10 ( 8.3%) I (100.0%) (0.0%)
3. December 10 ( 8.3%) I
4. February 10 ( 8.3%) I
5. January 10 ( 8.3%) I
6. July 10 ( 8.3%) I
7. June 10 ( 8.3%) I
8. March 10 ( 8.3%) I
9. May 10 ( 8.3%) I
10. November 10 ( 8.3%) I
[ 2 others ] 20 (16.7%) III
2 year Mean (sd) : 2008.5 (2.9) 2004 : 12 (10.0%) II 120 0
[integer] min < med < max: 2005 : 12 (10.0%) II (100.0%) (0.0%)
2004 < 2008.5 < 2013 2006 : 12 (10.0%) II
IQR (CV) : 5 (0) 2007 : 12 (10.0%) II
2008 : 12 (10.0%) II
2009 : 12 (10.0%) II
2010 : 12 (10.0%) II
2011 : 12 (10.0%) II
2012 : 12 (10.0%) II
2013 : 12 (10.0%) II
3 large_half_dozen Mean (sd) : 155.2 (22.6) 126.00 : 1 ( 0.8%) 120 0
[numeric] min < med < max: 128.50 : 29 (24.2%) IIII (100.0%) (0.0%)
126 < 174.5 < 178 129.75 : 1 ( 0.8%)
IQR (CV) : 45.1 (0.1) 131.00 : 3 ( 2.5%)
131.12!: 1 ( 0.8%)
132.00 : 15 (12.5%) II
133.50 : 3 ( 2.5%)
173.25 : 6 ( 5.0%) I
174.50 : 47 (39.2%) IIIIIII
178.00 : 14 (11.7%) II
! rounded
4 large_dozen Mean (sd) : 254.2 (18.5) 12 distinct values : 120 0
[numeric] min < med < max: : (100.0%) (0.0%)
225 < 267.5 < 277.5 : :
IQR (CV) : 34.5 (0.1) : . : .
. : : : :
5 extra_large_half_dozen Mean (sd) : 164.2 (24.7) 132.00 : 1 ( 0.8%) 120 0
[numeric] min < med < max: 134.50 : 1 ( 0.8%) (100.0%) (0.0%)
132 < 185.5 < 188.1 135.50 : 28 (23.3%) IIII
IQR (CV) : 49.7 (0.2) 135.88!: 1 ( 0.8%)
137.00 : 6 ( 5.0%) I
138.12!: 1 ( 0.8%)
139.00 : 15 (12.5%) II
185.50 : 53 (44.2%) IIIIIIII
188.13 : 14 (11.7%) II
! rounded
6 extra_large_dozen Mean (sd) : 266.8 (22.8) 11 distinct values : 120 0
[numeric] min < med < max: . : (100.0%) (0.0%)
230 < 285.5 < 290 : :
IQR (CV) : 44 (0.1) : :
. : :
-------------------------------------------------------------------------------------------------------------------------
month year large_half_dozen large_dozen
Length:120 Min. :2004 Min. :126.0 Min. :225.0
Class :character 1st Qu.:2006 1st Qu.:129.4 1st Qu.:233.5
Mode :character Median :2008 Median :174.5 Median :267.5
Mean :2008 Mean :155.2 Mean :254.2
3rd Qu.:2011 3rd Qu.:174.5 3rd Qu.:268.0
Max. :2013 Max. :178.0 Max. :277.5
extra_large_half_dozen extra_large_dozen
Min. :132.0 Min. :230.0
1st Qu.:135.8 1st Qu.:241.5
Median :185.5 Median :285.5
Mean :164.2 Mean :266.8
3rd Qu.:185.5 3rd Qu.:285.5
Max. :188.1 Max. :290.0
month year large_half_dozen large_dozen extra_large_half_dozen
1 January 2004 126.0 230.000 132.0
2 February 2004 128.5 226.250 134.5
3 March 2004 131.0 225.000 137.0
4 April 2004 131.0 225.000 137.0
5 May 2004 131.0 225.000 137.0
6 June 2004 133.5 231.375 137.0
extra_large_dozen
1 230.0
2 230.0
3 230.0
4 234.5
5 236.0
6 241.0
We can see from the rendered data frame summary that there are 6 columns and 120 rows of data, covering the years 2004 through 2013 as well as all 12 months of the year. The remaining 4 columns, with values ranging from 126-290, are large half dozen, large dozen, extra large half dozen, and extra large dozen.
The first step in pivoting the data is to try to come up with a concrete vision of what the end product should look like - that way you will know whether or not your pivoting was successful.
One easy way to do this is to think about the dimensions of your current data (tibble, dataframe, or matrix), and then calculate what the dimensions of the pivoted data should be.
Suppose you have a dataset with \(n\) rows and \(k\) variables. In our example, 3 of the variables are used to identify a case, so you will be pivoting \(k-3\) variables into a longer format where the \(k-3\) variable names will move into the names_to
variable and the current values in each of those columns will move into the values_to
variable. Therefore, we would expect \(n * (k-3)\) rows in the pivoted dataframe!
[1] 120
[1] 6
[1] 480
[1] 5
Or simple example has \(n = 6\) rows and \(k - 3 = 2\) variables being pivoted, so we expect a new dataframe to have \(n * 2 = 12\) rows x \(3 + 2 = 5\) columns.
Document your work here.
Due to the fact that each size-quantity pairing will have its own row, I anticipate the eventual data set to be four times longer (rather than the initial 4 columns after month and year). Since I will delete the names of the four size-quantity columns and replace them with one column each for size, quantity, and average price, I anticipate that the overall number of columns will decrease by one. For this newly reorganized data, see below.
Any additional comments?
Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a “sanity” check.
# A tibble: 480 × 5
month year size quantity amount
<chr> <int> <chr> <chr> <dbl>
1 January 2004 large half 126
2 January 2004 large dozen 230
3 January 2004 extra large 132
4 January 2004 extra large 230
5 February 2004 large half 128.
6 February 2004 large dozen 226.
7 February 2004 extra large 134.
8 February 2004 extra large 230
9 March 2004 large half 131
10 March 2004 large dozen 225
# … with 470 more rows
[1] 480
[1] 5
Document your work here. What will a new “case” be once you have pivoted the data? How does it meet requirements for tidy data?
As expected, I now have one less column and four times as many rows (120 became 480). (6 became 5). I now have one observation for each row and am prepared to start my analysis. Depending on the nature of my study or if there are more tables to put in at some time, I might want to change the cents into dollars. The data keeps the precise decimals even when the preview tibble below rounds to whole cents.
# A tibble: 480 × 5
month year size quantity avg_USD
<chr> <int> <chr> <chr> <dbl>
1 January 2004 large half 1.26
2 January 2004 large dozen 2.3
3 January 2004 extra large 1.32
4 January 2004 extra large 2.3
5 February 2004 large half 1.28
6 February 2004 large dozen 2.26
7 February 2004 extra large 1.34
8 February 2004 extra large 2.3
9 March 2004 large half 1.31
10 March 2004 large dozen 2.25
# … with 470 more rows
Any additional comments?
---
title: "Challenge 3"
author: "Siddharth Nammara Kalyana Raman"
desription: "Tidy Data: Pivoting"
date: "12/02/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_3
- animal_weights
- eggs
- australian_marriage
- usa_households
- sce_labor
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
library(summarytools)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to:
1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
2. identify what needs to be done to tidy the current data
3. anticipate the shape of pivoted data
4. pivot the data into tidy format using `pivot_longer`
## Read in data
Read in one (or more) of the following datasets, using the correct R package and command.
- animal_weights.csv ⭐
- eggs_tidy.csv ⭐⭐ or organiceggpoultry.xls ⭐⭐⭐
- australian_marriage\*.xls ⭐⭐⭐
- USA Households\*.xlsx ⭐⭐⭐⭐
- sce_labor_chart_data_public.xlsx 🌟🌟🌟🌟🌟
```{r}
#read data and name it eggs_data
eggs_data<-read.csv('_data/eggs_tidy.csv')
eggs_data
```
### Briefly describe the data
Describe the data, and be sure to comment on why you are planning to pivot it to make it "tidy"
```{r}
print(dfSummary(eggs_data))
```
```{r}
# to know the statistics of the eggs dataset
summary(eggs_data)
```
```{r}
# review the first few rows of the dataset
head(eggs_data)
```
We can see from the rendered data frame summary that there are 6 columns and 120 rows of data, covering the years 2004 through 2013 as well as all 12 months of the year. The remaining 4 columns, with values ranging from 126-290, are large half dozen, large dozen, extra large half dozen, and extra large dozen.
## Anticipate the End Result
The first step in pivoting the data is to try to come up with a concrete vision of what the end product *should* look like - that way you will know whether or not your pivoting was successful.
One easy way to do this is to think about the dimensions of your current data (tibble, dataframe, or matrix), and then calculate what the dimensions of the pivoted data should be.
Suppose you have a dataset with $n$ rows and $k$ variables. In our example, 3 of the variables are used to identify a case, so you will be pivoting $k-3$ variables into a longer format where the $k-3$ variable names will move into the `names_to` variable and the current values in each of those columns will move into the `values_to` variable. Therefore, we would expect $n * (k-3)$ rows in the pivoted dataframe!
### Example: find current and future data dimensions
```{r}
#Finding the dimensions before pivoting
#number of rows
nrow(eggs_data)
#number of columns
ncol(eggs_data)
```
```{r}
#expected dimensions after pivot
#expected number of rows
nrow(eggs_data)*(ncol(eggs_data)-2)
#expected number of cols
3+2
```
Or simple example has $n = 6$ rows and $k - 3 = 2$ variables being pivoted, so we expect a new dataframe to have $n * 2 = 12$ rows x $3 + 2 = 5$ columns.
### Challenge: Describe the final dimensions
Document your work here.
Due to the fact that each size-quantity pairing will have its own row, I anticipate the eventual data set to be four times longer (rather than the initial 4 columns after month and year). Since I will delete the names of the four size-quantity columns and replace them with one column each for size, quantity, and average price, I anticipate that the overall number of columns will decrease by one. For this newly reorganized data, see below.
Any additional comments?
## Pivot the Data
Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a "sanity" check.
```{r}
#| tbl-cap: Pivoted Example
eggs_longer<- eggs_data%>%
pivot_longer(cols=contains("large"),
names_to = c("size", "quantity"),
names_sep="_",
values_to = "amount"
)
eggs_longer
#number of rows
nrow(eggs_longer)
#number of cols
ncol(eggs_longer)
```
### Challenge: Pivot the Chosen Data
Document your work here. What will a new "case" be once you have pivoted the data? How does it meet requirements for tidy data?
As expected, I now have one less column and four times as many rows (120 became 480). (6 became 5). I now have one observation for each row and am prepared to start my analysis. Depending on the nature of my study or if there are more tables to put in at some time, I might want to change the cents into dollars. The data keeps the precise decimals even when the preview tibble below rounds to whole cents.
```{r}
eggs_price <- mutate(eggs_longer,
avg_USD = amount / 100
)%>%
select(!contains ("amount"))
eggs_price
```
Any additional comments?