DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 3

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Challenge Overview
  • Read in data
    • Briefly describe the data
    • Challenge: Describe the final dimensions
  • Pivot the Data
    • Pivot_Wider

Challenge 3

  • Show All Code
  • Hide All Code

  • View Source
challenge_3
animal_weights
eggs
australian_marriage
usa_households
sce_labor
Author

Lai Wei

Published

November 8, 2022

Code
library(tidyverse)
library(readxl)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. identify what needs to be done to tidy the current data
  3. anticipate the shape of pivoted data
  4. pivot the data into tidy format using pivot_longer

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

  • animal_weights.csv ⭐
  • eggs_tidy.csv ⭐⭐ or organiceggpoultry.xls ⭐⭐⭐
  • australian_marriage*.xls ⭐⭐⭐
  • USA Households*.xlsx ⭐⭐⭐⭐
  • sce_labor_chart_data_public.xlsx 🌟🌟🌟🌟🌟
Code
Households <- read_excel("_data/USA Households by Total Money Income, Race, and Hispanic Origin of Householder 1967 to 2019.xlsx",
                         skip = 5,
                         col_names = c("Year","Household","Total",str_c("income",1:9,sep = "_i"),"Median_income","MarginOfError","Mean_income","MarginOfError"))
Households
# A tibble: 383 × 16
   Year    House…¹ Total incom…² incom…³ incom…⁴ incom…⁵ incom…⁶ incom…⁷ incom…⁸
   <chr>   <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 ALL RA… <NA>       NA    NA      NA      NA      NA      NA      NA      NA  
 2 2019    128451    100     9.1     8       8.3    11.7    16.5    12.3    15.5
 3 2018    128579    100    10.1     8.8     8.7    12      17      12.5    15  
 4 2017 2  127669    100    10       9.1     9.2    12      16.4    12.4    14.7
 5 2017    127586    100    10.1     9.1     9.2    11.9    16.3    12.6    14.8
 6 2016    126224    100    10.4     9       9.2    12.3    16.7    12.2    15  
 7 2015    125819    100    10.6    10       9.6    12.1    16.1    12.4    14.9
 8 2014    124587    100    11.4    10.5     9.6    12.6    16.4    12.1    14  
 9 2013 3  123931    100    11.4    10.3     9.5    12.5    16.8    12      13.9
10 2013 4  122952    100    11.3    10.4     9.7    13.1    17      12.5    13.6
# … with 373 more rows, 6 more variables: income_i8 <dbl>, income_i9 <dbl>,
#   Median_income <dbl>, MarginOfError...14 <dbl>, Mean_income <chr>,
#   MarginOfError...16 <chr>, and abbreviated variable names ¹​Household,
#   ²​income_i1, ³​income_i2, ⁴​income_i3, ⁵​income_i4, ⁶​income_i5, ⁷​income_i6,
#   ⁸​income_i7

Read the data and rename the columns.

Briefly describe the data

Existing the number of rows

Code
nrow(Households)
[1] 383

There are 383 statistics in different years in total.

Challenge: Describe the final dimensions

Code
Households %>% 
  filter(Year%in%c(2007,2017)) 
# A tibble: 16 × 16
   Year  Household Total incom…¹ incom…² incom…³ incom…⁴ incom…⁵ incom…⁶ incom…⁷
   <chr> <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 2017  127586      100    10.1     9.1     9.2    11.9    16.3    12.6    14.8
 2 2007  116783      100    10      10       9      12.8    17.3    12.6    15.3
 3 2017  100065      100     8.6     8.7     8.9    11.7    16.5    12.9    15.7
 4 2007  95112       100     8.4     9.6     8.9    12.6    17.5    13      16.1
 5 2017  84681       100     8       8.5     8.5    11.2    16.1    13.1    16.3
 6 2007  82765       100     7.9     9.2     8.5    12      17.3    13.2    16.9
 7 2017  17801       100    18.9    12.1    11.8    13.8    15.7    10.6     9.9
 8 2007  14976       100    19.5    13.1    10.5    14.7    16.6     9.8    10.2
 9 2017  16997       100    19.1    12.2    11.8    13.9    15.7    10.3     9.8
10 2007  14551       100    19.6    13.1    10.5    14.7    16.7     9.8    10  
11 2017  7114        100     8.6     6.3     6       9.3    14.4    12.6    16.3
12 2007  4715        100     8.6     6.6     6.6     9.7    15.4    12.8    19.1
13 2017  6735        100     8.7     6.4     5.7     9.1    14.4    12.8    16.3
14 2007  4494        100     8.6     6.7     6.5     9.5    15.5    12.5    19.5
15 2017  17318       100    11.6    10.3    11.4    14.3    18.5    12.5    12.2
16 2007  13339       100    12.5    12.3    11.5    16.6    18.6    11.9    10.5
# … with 6 more variables: income_i8 <dbl>, income_i9 <dbl>,
#   Median_income <dbl>, MarginOfError...14 <dbl>, Mean_income <chr>,
#   MarginOfError...16 <chr>, and abbreviated variable names ¹​income_i1,
#   ²​income_i2, ³​income_i3, ⁴​income_i4, ⁵​income_i5, ⁶​income_i6, ⁷​income_i7

Use the filter() to get the information in 2007 and 2017.

Pivot the Data

Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a “sanity” check.

Pivot_Wider

Code
Households %>% 
  filter(Year%in%2018) %>% 
  select(Household,Year,Median_income) %>% 
  pivot_wider(values_from = Median_income ,names_from = Year)
# A tibble: 8 × 2
  Household `2018`
  <chr>      <dbl>
1 128579     64324
2 100528     68156
3 84727      71922
4 18095      42447
5 17167      42110
6 7416       88388
7 6981       88774
8 17758      52382

By using pivot_wider() function, we can get a new table that showing in year 2018, how much each household have in the median range.

Source Code
---
title: "Challenge 3"
author: "Lai Wei"
desription: "Tidy Data: Pivoting"
date: "11/08/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_3
  - animal_weights
  - eggs
  - australian_marriage
  - usa_households
  - sce_labor
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)
library(readxl)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

## Challenge Overview

Today's challenge is to:

1.  read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
2.  identify what needs to be done to tidy the current data
3.  anticipate the shape of pivoted data
4.  pivot the data into tidy format using `pivot_longer`

## Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

-   animal_weights.csv ⭐
-   eggs_tidy.csv ⭐⭐ or organiceggpoultry.xls ⭐⭐⭐
-   australian_marriage\*.xls ⭐⭐⭐
-   USA Households\*.xlsx ⭐⭐⭐⭐
-   sce_labor_chart_data_public.xlsx 🌟🌟🌟🌟🌟

```{r}
Households <- read_excel("_data/USA Households by Total Money Income, Race, and Hispanic Origin of Householder 1967 to 2019.xlsx",
                         skip = 5,
                         col_names = c("Year","Household","Total",str_c("income",1:9,sep = "_i"),"Median_income","MarginOfError","Mean_income","MarginOfError"))
Households
```
Read the data and rename the columns. 

### Briefly describe the data
Existing the number of rows
```{r}
nrow(Households)
```
There are 383 statistics in different years in total. 

### Challenge: Describe the final dimensions

```{r}
Households %>% 
  filter(Year%in%c(2007,2017)) 
```
Use the filter() to get the information in 2007 and 2017.

## Pivot the Data

Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a "sanity" check.

### Pivot_Wider

```{r}
Households %>% 
  filter(Year%in%2018) %>% 
  select(Household,Year,Median_income) %>% 
  pivot_wider(values_from = Median_income ,names_from = Year)
```
By using pivot_wider() function, we can get a new table that showing in year 2018, how much each household have in the median range.