DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 3

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Briefly describe the data
  • Anticipate the End Result
    • Example: find current and future data dimensions
    • Challenge: Describe the final dimensions
  • Pivot the Data
    • Example
    • Challenge: Pivot the Chosen Data

Challenge 3

  • Show All Code
  • Hide All Code

  • View Source
challenge_3
animal_weights
eggs
australian_marriage
usa_households
sce_labor
Author

Mariia Dubyk

Published

November 1, 2022

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
Code
library(readxl)
marriage <- read_excel("_data/australian_marriage_law_postal_survey_2017_-_response_final.xls",
                       sheet = "Table 2",
                       col_names = c("Division", "Yes", "Yes_percentage", "No", "No_percentage", "Total_clear", "Total_clear_percentage", "Delete", "Delete", "Delete", "Response_not_clear", "Response_not_clear_percentage", "Non_responding", "Non_responding_percentage", "Total", "Total_percentage"))

marriage <- slice(marriage, 8:180)
marriage <- select(marriage, "Division", "Yes", "Yes_percentage", "No", "No_percentage", "Total_clear", "Total_clear_percentage", "Response_not_clear", "Response_not_clear_percentage", "Non_responding", "Non_responding_percentage", "Total", "Total_percentage")
marriage
# A tibble: 173 × 13
   Division  Yes   Yes_p…¹ No    No_pe…² Total…³ Total…⁴ Respo…⁵ Respo…⁶ Non_r…⁷
   <chr>     <chr> <chr>   <chr> <chr>   <chr>   <chr>   <chr>   <chr>   <chr>  
 1 New Sout… <NA>  <NA>    <NA>  <NA>    <NA>    <NA>    <NA>    <NA>    <NA>   
 2 Banks     37736 44.899… 46343 55.100… 84079   100     247     0.2000… 20928  
 3 Barton    37153 43.600… 47984 56.399… 85137   100     226     0.2000… 24008  
 4 Bennelong 42943 49.799… 43215 50.200… 86158   100     244     0.2000… 19973  
 5 Berowra   48471 54.600… 40369 45.399… 88840   100     212     0.2000… 16038  
 6 Blaxland  20406 26.100… 57926 73.900… 78332   100     220     0.2000… 25883  
 7 Bradfield 53681 60.600… 34927 39.399… 88608   100     202     0.2000… 17261  
 8 Calare    54091 60.200… 35779 39.799… 89870   100     285     0.2000… 25342  
 9 Chifley   32871 41.299… 46702 58.700… 79573   100     263     0.2000… 28180  
10 Cook      47505 55      38804 45      86309   100     229     0.2000… 18713  
# … with 163 more rows, 3 more variables: Non_responding_percentage <chr>,
#   Total <chr>, Total_percentage <chr>, and abbreviated variable names
#   ¹​Yes_percentage, ²​No_percentage, ³​Total_clear, ⁴​Total_clear_percentage,
#   ⁵​Response_not_clear, ⁶​Response_not_clear_percentage, ⁷​Non_responding

Briefly describe the data

The dataset shows responses to the question in Australian Marriage Law Postal Survey 2017. We can see data gathered in different divisions of States and Territories in Australia. The responses are “Yes”, “No”, “Not clear” and “No response”. The table also shows percentage of each answer and total numbers.

To tidy data I will leave only divisions in the first column. I am going to move State and Territories to another column by adding a variable (as it was shown during the class).I will also remove percentages and total numbers from the table.

Code
marriage <- marriage%>%
  mutate(StateTerritory = case_when(str_ends(Division, "Divisions") ~ Division), NA_character_)%>%
  fill(StateTerritory, .direction = "down")
marriage <- filter(marriage, !str_detect(Division, "Division|Australia"))
marriage <- select(marriage, "Division", "Yes", "No", "Response_not_clear", "Non_responding", "StateTerritory")
marriage
# A tibble: 155 × 6
   Division  Yes   No    Response_not_clear Non_responding StateTerritory       
   <chr>     <chr> <chr> <chr>              <chr>          <chr>                
 1 Banks     37736 46343 247                20928          New South Wales Divi…
 2 Barton    37153 47984 226                24008          New South Wales Divi…
 3 Bennelong 42943 43215 244                19973          New South Wales Divi…
 4 Berowra   48471 40369 212                16038          New South Wales Divi…
 5 Blaxland  20406 57926 220                25883          New South Wales Divi…
 6 Bradfield 53681 34927 202                17261          New South Wales Divi…
 7 Calare    54091 35779 285                25342          New South Wales Divi…
 8 Chifley   32871 46702 263                28180          New South Wales Divi…
 9 Cook      47505 38804 229                18713          New South Wales Divi…
10 Cowper    57493 38317 315                25197          New South Wales Divi…
# … with 145 more rows

Anticipate the End Result

In the final dataframe we will have 4 rows (with answers “Yes”, “No”,“Response_not_clear”, “Non_responding”) for each division.

Example: find current and future data dimensions

Lets see if this works with a simple example.

Code
df<-tibble(country = rep(c("Mexico", "USA", "France"),2),
           year = rep(c(1980,1990), 3), 
           trade = rep(c("NAFTA", "NAFTA", "EU"),2),
           outgoing = rnorm(6, mean=1000, sd=500),
           incoming = rlogis(6, location=1000, 
                             scale = 400))
df
# A tibble: 6 × 5
  country  year trade outgoing incoming
  <chr>   <dbl> <chr>    <dbl>    <dbl>
1 Mexico   1980 NAFTA     217.    1102.
2 USA      1990 NAFTA    -585.    1869.
3 France   1980 EU       1351.     369.
4 Mexico   1990 NAFTA     846.    1458.
5 USA      1980 NAFTA     551.    1212.
6 France   1990 EU       1932.     749.
Code
#existing rows/cases
nrow(df)
[1] 6
Code
#existing columns/cases
ncol(df)
[1] 5
Code
#expected rows/cases
nrow(df) * (ncol(df)-3)
[1] 12
Code
# expected columns 
3 + 2
[1] 5

Or simple example has \(n = 6\) rows and \(k - 3 = 2\) variables being pivoted, so we expect a new dataframe to have \(n * 2 = 12\) rows x \(3 + 2 = 5\) columns.

Challenge: Describe the final dimensions

Any additional comments?

Pivot the Data

Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a “sanity” check.

Example

Code
df<-pivot_longer(df, col = c(outgoing, incoming),
                 names_to="trade_direction",
                 values_to = "trade_value")
df
# A tibble: 12 × 5
   country  year trade trade_direction trade_value
   <chr>   <dbl> <chr> <chr>                 <dbl>
 1 Mexico   1980 NAFTA outgoing               217.
 2 Mexico   1980 NAFTA incoming              1102.
 3 USA      1990 NAFTA outgoing              -585.
 4 USA      1990 NAFTA incoming              1869.
 5 France   1980 EU    outgoing              1351.
 6 France   1980 EU    incoming               369.
 7 Mexico   1990 NAFTA outgoing               846.
 8 Mexico   1990 NAFTA incoming              1458.
 9 USA      1980 NAFTA outgoing               551.
10 USA      1980 NAFTA incoming              1212.
11 France   1990 EU    outgoing              1932.
12 France   1990 EU    incoming               749.

Yes, once it is pivoted long, our resulting data are \(12x5\) - exactly what we expected!

Challenge: Pivot the Chosen Data

After pivoting the case is number of certain response in a division. I think it may be easier to visualize the data using this type of table. So we may say that data is tidy because it is easier to make next steps in analyzing data.

Code
marriage_new<-pivot_longer(marriage, col = c(Yes, No, Response_not_clear, Non_responding),
                 names_to="Response_type",
                 values_to = "Number")
marriage_new
# A tibble: 620 × 4
   Division  StateTerritory            Response_type      Number
   <chr>     <chr>                     <chr>              <chr> 
 1 Banks     New South Wales Divisions Yes                37736 
 2 Banks     New South Wales Divisions No                 46343 
 3 Banks     New South Wales Divisions Response_not_clear 247   
 4 Banks     New South Wales Divisions Non_responding     20928 
 5 Barton    New South Wales Divisions Yes                37153 
 6 Barton    New South Wales Divisions No                 47984 
 7 Barton    New South Wales Divisions Response_not_clear 226   
 8 Barton    New South Wales Divisions Non_responding     24008 
 9 Bennelong New South Wales Divisions Yes                42943 
10 Bennelong New South Wales Divisions No                 43215 
# … with 610 more rows
Source Code
---
title: "Challenge 3"
author: "Mariia Dubyk"
desription: "Tidy Data: Pivoting"
date: "11/01/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_3
  - animal_weights
  - eggs
  - australian_marriage
  - usa_households
  - sce_labor
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

```{r}
library(readxl)
marriage <- read_excel("_data/australian_marriage_law_postal_survey_2017_-_response_final.xls",
                       sheet = "Table 2",
                       col_names = c("Division", "Yes", "Yes_percentage", "No", "No_percentage", "Total_clear", "Total_clear_percentage", "Delete", "Delete", "Delete", "Response_not_clear", "Response_not_clear_percentage", "Non_responding", "Non_responding_percentage", "Total", "Total_percentage"))

marriage <- slice(marriage, 8:180)
marriage <- select(marriage, "Division", "Yes", "Yes_percentage", "No", "No_percentage", "Total_clear", "Total_clear_percentage", "Response_not_clear", "Response_not_clear_percentage", "Non_responding", "Non_responding_percentage", "Total", "Total_percentage")
marriage
```

### Briefly describe the data

The dataset shows responses to the question in Australian Marriage Law Postal Survey 2017. We can see data gathered in different divisions of States and Territories in Australia. The responses are "Yes", "No", "Not clear" and "No response". The table also shows percentage of each answer and total numbers.

To tidy data I will leave only divisions in the first column. I am going to move State and Territories to another column by adding a variable (as it was shown during the class).I will also remove percentages and total numbers from the table.

```{r}
marriage <- marriage%>%
  mutate(StateTerritory = case_when(str_ends(Division, "Divisions") ~ Division), NA_character_)%>%
  fill(StateTerritory, .direction = "down")
marriage <- filter(marriage, !str_detect(Division, "Division|Australia"))
marriage <- select(marriage, "Division", "Yes", "No", "Response_not_clear", "Non_responding", "StateTerritory")
marriage
```


## Anticipate the End Result

In the final dataframe we will have 4 rows (with answers "Yes", "No","Response_not_clear", "Non_responding") for each division. 


### Example: find current and future data dimensions

Lets see if this works with a simple example.

```{r}
#| tbl-cap: Example

df<-tibble(country = rep(c("Mexico", "USA", "France"),2),
           year = rep(c(1980,1990), 3), 
           trade = rep(c("NAFTA", "NAFTA", "EU"),2),
           outgoing = rnorm(6, mean=1000, sd=500),
           incoming = rlogis(6, location=1000, 
                             scale = 400))
df

#existing rows/cases
nrow(df)

#existing columns/cases
ncol(df)

#expected rows/cases
nrow(df) * (ncol(df)-3)

# expected columns 
3 + 2
```

Or simple example has $n = 6$ rows and $k - 3 = 2$ variables being pivoted, so we expect a new dataframe to have $n * 2 = 12$ rows x $3 + 2 = 5$ columns.

### Challenge: Describe the final dimensions


```{r}
```

Any additional comments?

## Pivot the Data

Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a "sanity" check.

### Example

```{r}
#| tbl-cap: Pivoted Example

df<-pivot_longer(df, col = c(outgoing, incoming),
                 names_to="trade_direction",
                 values_to = "trade_value")
df
```

Yes, once it is pivoted long, our resulting data are $12x5$ - exactly what we expected!

### Challenge: Pivot the Chosen Data

After pivoting the case is number of certain response in a division. I think it may be easier to visualize the data using this type of table. So we may say that data is tidy because it is easier to make next steps in analyzing data. 

```{r}
marriage_new<-pivot_longer(marriage, col = c(Yes, No, Response_not_clear, Non_responding),
                 names_to="Response_type",
                 values_to = "Number")
marriage_new
```