month year large_half_dozen large_dozen extra_large_half_dozen
1 January 2004 126.0 230.000 132.0
2 February 2004 128.5 226.250 134.5
3 March 2004 131.0 225.000 137.0
4 April 2004 131.0 225.000 137.0
5 May 2004 131.0 225.000 137.0
6 June 2004 133.5 231.375 137.0
extra_large_dozen
1 230.0
2 230.0
3 230.0
4 234.5
5 236.0
6 241.0
From the original file, it looks like the file is about the number of purchase of different sizes eggs in each year and months.
We can look the trend of the number in years. So first we drop the month colunm and sum all the number by years. The tibble is wide and not clear to see the year trend, so we pivot it.
I expect that the colunm is each year and row is each dozen, which is like the dataframe after using t(). The pivot function seems like unvailable in this situation.
From this tibble, we can see the year trend of each dozen much clearer. We can also know people bought more extra large dozen and large dozen than other two scales in general.
We can also choose a specific month like January to see. The average value of extra_large_half_dozen in January from 2009 to 2012 didn’t change which means it was quite steady. All scales of dozen had a large increase between 2008 and 2009. It should happen something in January 2009. Even though the general trend is increasing, the large_dozen has a little decrease from 2009.
Example: find current and future data dimensions
One easy way to do this is to think about the dimensions of your current data (tibble, dataframe, or matrix), and then calculate what the dimensions of the pivoted data should be.
Suppose you have a dataset with \(n\) rows and \(k\) variables. In our example, 3 of the variables are used to identify a case, so you will be pivoting \(k-3\) variables into a longer format where the \(k-3\) variable names will move into the names_to variable and the current values in each of those columns will move into the values_to variable. Therefore, we would expect \(n * (k-3)\) rows in the pivoted dataframe!
# A tibble: 6 × 5
country year trade outgoing incoming
<chr> <dbl> <chr> <dbl> <dbl>
1 Mexico 1980 NAFTA 83.8 679.
2 USA 1990 NAFTA 2184. 446.
3 France 1980 EU 777. 1275.
4 Mexico 1990 NAFTA 1325. 1354.
5 USA 1980 NAFTA 1026. 1063.
6 France 1990 EU 1217. 1813.
Code
#existing rows/casesnrow(df)
[1] 6
Code
#existing columns/casesncol(df)
[1] 5
Code
#expected rows/casesnrow(df) * (ncol(df)-3)
[1] 12
Code
# expected columns 3+2
[1] 5
Or simple example has \(n = 6\) rows and \(k - 3 = 2\) variables being pivoted, so we expect a new dataframe to have \(n * 2 = 12\) rows x \(3 + 2 = 5\) columns.
Pivot the Data
Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a “sanity” check.
Example
Code
df<-pivot_longer(df, col =c(outgoing, incoming),names_to="trade_direction",values_to ="trade_value")df
# A tibble: 12 × 5
country year trade trade_direction trade_value
<chr> <dbl> <chr> <chr> <dbl>
1 Mexico 1980 NAFTA outgoing 83.8
2 Mexico 1980 NAFTA incoming 679.
3 USA 1990 NAFTA outgoing 2184.
4 USA 1990 NAFTA incoming 446.
5 France 1980 EU outgoing 777.
6 France 1980 EU incoming 1275.
7 Mexico 1990 NAFTA outgoing 1325.
8 Mexico 1990 NAFTA incoming 1354.
9 USA 1980 NAFTA outgoing 1026.
10 USA 1980 NAFTA incoming 1063.
11 France 1990 EU outgoing 1217.
12 France 1990 EU incoming 1813.
Yes, once it is pivoted long, our resulting data are \(12x5\) - exactly what we expected!
Source Code
---title: "Challenge 3"author: "Shuqi Hong"description: "Tidy Data: Pivoting"date: "6/07/2023"format: html: toc: true code-fold: true code-copy: true code-tools: truecategories: - challenge_3 - animal_weights - eggs - australian_marriage - usa_households - sce_labor---```{r}#| label: setup#| warning: false#| message: falselibrary(tidyverse)knitr::opts_chunk$set(echo =TRUE, warning=FALSE, message=FALSE)eggs <-read.csv("_data/eggs_tidy.csv")head(eggs)```From the original file, it looks like the file is about the number of purchase of different sizes eggs in each year and months.We can look the trend of the number in years. So first we drop the month colunm and sum all the number by years. The tibble is wide and not clear to see the year trend, so we pivot it.```{r}yeartrend <- eggs %>%select(-month) %>%group_by(year) %>%summarise_all(sum) %>%t()yeartrendeggs %>%select(-month) %>%group_by(year) %>%pivot_wider(names_from = year, values_from =c(large_half_dozen,large_dozen))```I expect that the colunm is each year and row is each dozen, which is like the dataframe after using t(). The pivot function seems like unvailable in this situation.```{r}eggs %>%group_by(year) %>%pivot_longer(`large_half_dozen`:`extra_large_dozen`, names_to ="dozen", values_to ="number" ) %>%group_by(dozen,year) %>%summarise( mean =mean(`number`)) %>%pivot_wider(names_from = year, values_from = mean)```From this tibble, we can see the year trend of each dozen much clearer. We can also know people bought more extra large dozen and large dozen than other two scales in general.```{r}mean_January <- eggs %>%filter(month =="January") %>%pivot_longer(`large_half_dozen`:`extra_large_dozen`, names_to ="January_dozen", values_to ="number" ) %>%group_by(January_dozen,year) %>%summarise( January_mean =mean(`number`)) %>%pivot_wider(names_from = year, values_from = January_mean)mean_January```We can also choose a specific month like January to see. The average value of extra_large_half_dozen in January from 2009 to 2012 didn't change which means it was quite steady. All scales of dozen had a large increase between 2008 and 2009. It should happen something in January 2009. Even though the general trend is increasing, the large_dozen has a little decrease from 2009.### Example: find current and future data dimensionsOne easy way to do this is to think about the dimensions of your current data (tibble, dataframe, or matrix), and then calculate what the dimensions of the pivoted data should be.Suppose you have a dataset with $n$ rows and $k$ variables. In our example, 3 of the variables are used to identify a case, so you will be pivoting $k-3$ variables into a longer format where the $k-3$ variable names will move into the `names_to` variable and the current values in each of those columns will move into the `values_to` variable. Therefore, we would expect $n * (k-3)$ rows in the pivoted dataframe!Lets see if this works with a simple example.```{r}#| tbl-cap: Exampledf<-tibble(country =rep(c("Mexico", "USA", "France"),2),year =rep(c(1980,1990), 3), trade =rep(c("NAFTA", "NAFTA", "EU"),2),outgoing =rnorm(6, mean=1000, sd=500),incoming =rlogis(6, location=1000, scale =400))df#existing rows/casesnrow(df)#existing columns/casesncol(df)#expected rows/casesnrow(df) * (ncol(df)-3)# expected columns 3+2```Or simple example has $n = 6$ rows and $k - 3 = 2$ variables being pivoted, so we expect a new dataframe to have $n * 2 = 12$ rows x $3 + 2 = 5$ columns.## Pivot the DataNow we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a "sanity" check.### Example```{r}#| tbl-cap: Pivoted Exampledf<-pivot_longer(df, col =c(outgoing, incoming),names_to="trade_direction",values_to ="trade_value")df```Yes, once it is pivoted long, our resulting data are $12x5$ - exactly what we expected!