challenge_3
Tidy Data: Pivoting
Author

Priya Marla

Published

January 12, 2023

Code
library(tidyverse)
library(readxl)
library(stringr)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. identify what needs to be done to tidy the current data
  3. anticipate the shape of pivoted data
  4. pivot the data into tidy format using pivot_longer

Read in data

Read in one (or more) of the following datasets, using the correct R package and command. col_names = c(“District”,“Yes”, “Yes%”, “No”, “No%”,)

  • animal_weights.csv ⭐
  • eggs_tidy.csv ⭐⭐ or organicpoultry.xls ⭐⭐⭐
  • australian_marriage*.xlsx ⭐⭐⭐
  • USA Households*.xlsx ⭐⭐⭐⭐
  • sce_labor_chart_data_public.csv 🌟🌟🌟🌟🌟
Code
#loading the data from posts/_data folder
vote_states <- read_excel("_data/australian_marriage_law_postal_survey_2017_-_response_final.xls",sheet = "Table 2", skip = 7,
col_names = c("Location","Yes","rem","No","rem","rem","rem","rem","rem","rem","Response_not_clear","rem","No_response","rem","rem","rem"))
#dataset
vote_states
# A tibble: 184 × 16
   Location    Yes rem...3    No rem...5 rem...6 rem...7 rem...8 rem...9 rem..…¹
   <chr>     <dbl>   <dbl> <dbl>   <dbl>   <dbl>   <dbl> <lgl>     <dbl>   <dbl>
 1 New Sout…    NA    NA      NA    NA        NA      NA NA           NA    NA  
 2 Banks     37736    44.9 46343    55.1   84079     100 NA        84079    79.9
 3 Barton    37153    43.6 47984    56.4   85137     100 NA        85137    77.8
 4 Bennelong 42943    49.8 43215    50.2   86158     100 NA        86158    81  
 5 Berowra   48471    54.6 40369    45.4   88840     100 NA        88840    84.5
 6 Blaxland  20406    26.1 57926    73.9   78332     100 NA        78332    75  
 7 Bradfield 53681    60.6 34927    39.4   88608     100 NA        88608    83.5
 8 Calare    54091    60.2 35779    39.8   89870     100 NA        89870    77.8
 9 Chifley   32871    41.3 46702    58.7   79573     100 NA        79573    73.7
10 Cook      47505    55   38804    45     86309     100 NA        86309    82  
# … with 174 more rows, 6 more variables: Response_not_clear <dbl>,
#   rem...12 <dbl>, No_response <dbl>, rem...14 <dbl>, rem...15 <dbl>,
#   rem...16 <dbl>, and abbreviated variable name ¹​rem...10
Code
#Cleaning the dataset
votes <- vote_states %>%
  select(!contains("rem")) %>% #removing percentage columns
  mutate(State = case_when(str_ends(Location,"Divisions") ~ Location) ,.after = "Location") %>% #adding a new column state
  fill(State, .direction = "down") %>% #filling the values of the column state
  filter(!str_detect(Location, 'Total')) %>% #Removing rows having total counts
  filter(!str_detect(Location, 'Divisions')) %>% #Removing the divisions heading rows
  drop_na() #Removing rows with null values
votes
# A tibble: 150 × 6
   Location  State                       Yes    No Response_not_clear No_respo…¹
   <chr>     <chr>                     <dbl> <dbl>              <dbl>      <dbl>
 1 Banks     New South Wales Divisions 37736 46343                247      20928
 2 Barton    New South Wales Divisions 37153 47984                226      24008
 3 Bennelong New South Wales Divisions 42943 43215                244      19973
 4 Berowra   New South Wales Divisions 48471 40369                212      16038
 5 Blaxland  New South Wales Divisions 20406 57926                220      25883
 6 Bradfield New South Wales Divisions 53681 34927                202      17261
 7 Calare    New South Wales Divisions 54091 35779                285      25342
 8 Chifley   New South Wales Divisions 32871 46702                263      28180
 9 Cook      New South Wales Divisions 47505 38804                229      18713
10 Cowper    New South Wales Divisions 57493 38317                315      25197
# … with 140 more rows, and abbreviated variable name ¹​No_response

Briefly describe the data

Describe the data, and be sure to comment on why you are planning to pivot it to make it “tidy”

###Description

This is an Australian marriage law survey dataset. A survey has been conducted in Australia to identify whether people are ok with same gender marriages are not. People have to respond with either “yes” or “no”. Eligible participants response is sometimes clear, sometimes not clear and sometimes the candidates do not vote. When the response is clear- it can be either “yes” or “no”. In this dataset, for each location the statistics of “yes”, “no”, “Not clear” and “No Response” have been recorded.

It has been identified that few columns and rows are unnecessary for the dataset. Hence while cleaning the dataset, starting few rows i.e till 7 have been removed and ending few rows with NULL values have been removed. For each location corresponding division has been added in the “states” column. Percentage columns have been removed to get a tidy view of the statistics. Count of all possible responses for each location are now recorded in columns “Yes”, “No”, “Response_not_clear”, “No_response”.

To make the dataset more tidy, I’ll be using pivot_longer to get each repsonse type for specific location in separate row.

Anticipate the End Result

The first step in pivoting the data is to try to come up with a concrete vision of what the end product should look like - that way you will know whether or not your pivoting was successful.

One easy way to do this is to think about the dimensions of your current data (tibble, dataframe, or matrix), and then calculate what the dimensions of the pivoted data should be.

Suppose you have a dataset with \(n\) rows and \(k\) variables. In our example, 3 of the variables are used to identify a case, so you will be pivoting \(k-3\) variables into a longer format where the \(k-3\) variable names will move into the names_to variable and the current values in each of those columns will move into the values_to variable. Therefore, we would expect \(n * (k-3)\) rows in the pivoted dataframe!

Example: find current and future data dimensions

Lets see if this works with a simple example.

Code
df<-tibble(country = rep(c("Mexico", "USA", "France"),2),
           year = rep(c(1980,1990), 3), 
           trade = rep(c("NAFTA", "NAFTA", "EU"),2),
           outgoing = rnorm(6, mean=1000, sd=500),
           incoming = rlogis(6, location=1000, 
                             scale = 400))
df
# A tibble: 6 × 5
  country  year trade outgoing incoming
  <chr>   <dbl> <chr>    <dbl>    <dbl>
1 Mexico   1980 NAFTA     589.    -796.
2 USA      1990 NAFTA     999.     643.
3 France   1980 EU        718.    2316.
4 Mexico   1990 NAFTA    1708.     820.
5 USA      1980 NAFTA     790.    1421.
6 France   1990 EU       1014.    1128.
Code
#existing rows/cases
nrow(df)
[1] 6
Code
#existing columns/cases
ncol(df)
[1] 5
Code
#expected rows/cases
nrow(df) * (ncol(df)-3)
[1] 12
Code
# expected columns 
3 + 2
[1] 5

Or simple example has \(n = 6\) rows and \(k - 3 = 2\) variables being pivoted, so we expect a new dataframe to have \(n * 2 = 12\) rows x \(3 + 2 = 5\) columns.

Challenge: Describe the final dimensions

Document your work here.

Code
#executed dim to know the dimentions i.e number of rows and columns of the dataset
dim(votes)
[1] 150   6
Code
#To know the column names in the dataset
colnames(votes)
[1] "Location"           "State"              "Yes"               
[4] "No"                 "Response_not_clear" "No_response"       
Code
#existing rows/cases
nrow(votes)
[1] 150
Code
#existing columns/cases
ncol(votes)
[1] 6
Code
#expected rows/cases
nrow(votes) * (ncol(votes)-2)
[1] 600
Code
# expected columns 
2 + 2
[1] 4

Currently there are 150 rows and 6 columns. After using pivot_longer number of rows will increase and number of columns will reduce Expected number of rows = 600 Expected number of columns = 4

Any additional comments?

Pivot the Data

Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a “sanity” check.

Example

Code
df<-pivot_longer(df, col = c(outgoing, incoming),
                 names_to="trade_direction",
                 values_to = "trade_value")
df
# A tibble: 12 × 5
   country  year trade trade_direction trade_value
   <chr>   <dbl> <chr> <chr>                 <dbl>
 1 Mexico   1980 NAFTA outgoing               589.
 2 Mexico   1980 NAFTA incoming              -796.
 3 USA      1990 NAFTA outgoing               999.
 4 USA      1990 NAFTA incoming               643.
 5 France   1980 EU    outgoing               718.
 6 France   1980 EU    incoming              2316.
 7 Mexico   1990 NAFTA outgoing              1708.
 8 Mexico   1990 NAFTA incoming               820.
 9 USA      1980 NAFTA outgoing               790.
10 USA      1980 NAFTA incoming              1421.
11 France   1990 EU    outgoing              1014.
12 France   1990 EU    incoming              1128.

Yes, once it is pivoted long, our resulting data are \(12x5\) - exactly what we expected!

Challenge: Pivot the Chosen Data

Document your work here. What will a new “case” be once you have pivoted the data? How does it meet requirements for tidy data?

Code
votes_tidy <- pivot_longer(votes, col = c("Yes","No","Response_not_clear","No_response"),
                           names_to = "Response_type",
                           values_to = "Stats")
votes_tidy
# A tibble: 600 × 4
   Location  State                     Response_type      Stats
   <chr>     <chr>                     <chr>              <dbl>
 1 Banks     New South Wales Divisions Yes                37736
 2 Banks     New South Wales Divisions No                 46343
 3 Banks     New South Wales Divisions Response_not_clear   247
 4 Banks     New South Wales Divisions No_response        20928
 5 Barton    New South Wales Divisions Yes                37153
 6 Barton    New South Wales Divisions No                 47984
 7 Barton    New South Wales Divisions Response_not_clear   226
 8 Barton    New South Wales Divisions No_response        24008
 9 Bennelong New South Wales Divisions Yes                42943
10 Bennelong New South Wales Divisions No                 43215
# … with 590 more rows

To get a comparison of number of people who votes on each category or to calculates the statistics like mean etc. or for the visual representation the dataset has to present in a pivot longer way. Now it’s easier to understand the dataset.

Any additional comments?