DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 3 Solutions

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Challenge Overview
  • Read in data
  • Data Cleaning
    • Briefly describe the data
  • Anticipate the End Result
    • Challenge: Describe the final dimensions
  • Pivot the Data
    • Challenge: Pivot the Chosen Data

Challenge 3 Solutions

  • Show All Code
  • Hide All Code

  • View Source
challenge_3
animal_weights
eggs
australian_marriage
usa_households
sce_labor
Author

Vinitha Maheswaran

Published

November 24, 2022

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. identify what needs to be done to tidy the current data
  3. anticipate the shape of pivoted data
  4. pivot the data into tidy format using pivot_longer

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

  • animal_weights.csv ⭐
  • eggs_tidy.csv ⭐⭐ or organiceggpoultry.xls ⭐⭐⭐
  • australian_marriage*.xls ⭐⭐⭐
  • USA Households*.xlsx ⭐⭐⭐⭐
  • sce_labor_chart_data_public.xlsx 🌟🌟🌟🌟🌟

For this challenge I will be working with the “USA Households*.xlsx” data set.

Code
library(readxl)

# Reading the USA Households\*.xlsx data set and storing in a data frame

column_names = c("Year", "Household_Number_Thousands", "Total_Percent_Distribution", "Under $15,000", "$15,000 to $24,999", "$25,000 to $34,999", "35,000 to $49,999", "$50,000 to $74,999", "$75,000 to $99,999", "$100,000 to $149,999", "$150,000 to $199,999", "$200,000 and over", "Median_Income_Estimate", "Median_Income_MOE", "Mean_Income_Estimate", "Mean_Income_MOE")
usa_data <- read_excel("_data/USA Households by Total Money Income, Race, and Hispanic Origin of Householder 1967 to 2019.xlsx", col_names = column_names, skip = 5)
print(usa_data)
# A tibble: 383 × 16
   Year  House…¹ Total…² Under…³ $15,0…⁴ $25,0…⁵ 35,00…⁶ $50,0…⁷ $75,0…⁸ $100,…⁹
   <chr> <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 ALL … <NA>         NA    NA      NA      NA      NA      NA      NA      NA  
 2 2019  128451      100     9.1     8       8.3    11.7    16.5    12.3    15.5
 3 2018  128579      100    10.1     8.8     8.7    12      17      12.5    15  
 4 2017… 127669      100    10       9.1     9.2    12      16.4    12.4    14.7
 5 2017  127586      100    10.1     9.1     9.2    11.9    16.3    12.6    14.8
 6 2016  126224      100    10.4     9       9.2    12.3    16.7    12.2    15  
 7 2015  125819      100    10.6    10       9.6    12.1    16.1    12.4    14.9
 8 2014  124587      100    11.4    10.5     9.6    12.6    16.4    12.1    14  
 9 2013… 123931      100    11.4    10.3     9.5    12.5    16.8    12      13.9
10 2013… 122952      100    11.3    10.4     9.7    13.1    17      12.5    13.6
# … with 373 more rows, 6 more variables: `$150,000 to $199,999` <dbl>,
#   `$200,000 and over` <dbl>, Median_Income_Estimate <dbl>,
#   Median_Income_MOE <dbl>, Mean_Income_Estimate <chr>, Mean_Income_MOE <chr>,
#   and abbreviated variable names ¹​Household_Number_Thousands,
#   ²​Total_Percent_Distribution, ³​`Under $15,000`, ⁴​`$15,000 to $24,999`,
#   ⁵​`$25,000 to $34,999`, ⁶​`35,000 to $49,999`, ⁷​`$50,000 to $74,999`,
#   ⁸​`$75,000 to $99,999`, ⁹​`$100,000 to $149,999`

Since the “USA Households*.xlsx” data set is in Excel format, I am using the ‘readxl’ package for reading the data. After reading, the data is stored in a dataframe “usa_data”. The first three rows in the dataframe contains description about the data and the 4th and 5th row contains the column headings. I resolve this issue by skipping the first 5 rows while reading the data set and storing in dataframe with the renamed column names.

Data Cleaning

Code
# Removing the last 31 rows as they are just footnotes and not observations
usa_data <- head(usa_data,-31)
print(usa_data)
# A tibble: 352 × 16
   Year  House…¹ Total…² Under…³ $15,0…⁴ $25,0…⁵ 35,00…⁶ $50,0…⁷ $75,0…⁸ $100,…⁹
   <chr> <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 ALL … <NA>         NA    NA      NA      NA      NA      NA      NA      NA  
 2 2019  128451      100     9.1     8       8.3    11.7    16.5    12.3    15.5
 3 2018  128579      100    10.1     8.8     8.7    12      17      12.5    15  
 4 2017… 127669      100    10       9.1     9.2    12      16.4    12.4    14.7
 5 2017  127586      100    10.1     9.1     9.2    11.9    16.3    12.6    14.8
 6 2016  126224      100    10.4     9       9.2    12.3    16.7    12.2    15  
 7 2015  125819      100    10.6    10       9.6    12.1    16.1    12.4    14.9
 8 2014  124587      100    11.4    10.5     9.6    12.6    16.4    12.1    14  
 9 2013… 123931      100    11.4    10.3     9.5    12.5    16.8    12      13.9
10 2013… 122952      100    11.3    10.4     9.7    13.1    17      12.5    13.6
# … with 342 more rows, 6 more variables: `$150,000 to $199,999` <dbl>,
#   `$200,000 and over` <dbl>, Median_Income_Estimate <dbl>,
#   Median_Income_MOE <dbl>, Mean_Income_Estimate <chr>, Mean_Income_MOE <chr>,
#   and abbreviated variable names ¹​Household_Number_Thousands,
#   ²​Total_Percent_Distribution, ³​`Under $15,000`, ⁴​`$15,000 to $24,999`,
#   ⁵​`$25,000 to $34,999`, ⁶​`35,000 to $49,999`, ⁷​`$50,000 to $74,999`,
#   ⁸​`$75,000 to $99,999`, ⁹​`$100,000 to $149,999`
Code
# Dimensions of usa_data
dim(usa_data)
[1] 352  16
Code
# Creating new column for Race and filling the empty values for Race with the previous value in that column

usa_data <- usa_data%>%
  mutate(Race = case_when(str_detect(Year,("([A-Z])")) ~ Year))%>%
  fill(Race, .direction = "down")

# Removing the rows from usa_data which has non-numerical values in Year column (these rows have Race value in the Year column and were inserted as separators for different Race groups)

usa_data <- usa_data%>%
  filter(!str_detect(Year,("([A-Z])")))

print(usa_data)
# A tibble: 340 × 17
   Year  House…¹ Total…² Under…³ $15,0…⁴ $25,0…⁵ 35,00…⁶ $50,0…⁷ $75,0…⁸ $100,…⁹
   <chr> <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 2019  128451      100     9.1     8       8.3    11.7    16.5    12.3    15.5
 2 2018  128579      100    10.1     8.8     8.7    12      17      12.5    15  
 3 2017… 127669      100    10       9.1     9.2    12      16.4    12.4    14.7
 4 2017  127586      100    10.1     9.1     9.2    11.9    16.3    12.6    14.8
 5 2016  126224      100    10.4     9       9.2    12.3    16.7    12.2    15  
 6 2015  125819      100    10.6    10       9.6    12.1    16.1    12.4    14.9
 7 2014  124587      100    11.4    10.5     9.6    12.6    16.4    12.1    14  
 8 2013… 123931      100    11.4    10.3     9.5    12.5    16.8    12      13.9
 9 2013… 122952      100    11.3    10.4     9.7    13.1    17      12.5    13.6
10 2012  122459      100    11.4    10.6    10.1    12.5    17.4    12      13.9
# … with 330 more rows, 7 more variables: `$150,000 to $199,999` <dbl>,
#   `$200,000 and over` <dbl>, Median_Income_Estimate <dbl>,
#   Median_Income_MOE <dbl>, Mean_Income_Estimate <chr>, Mean_Income_MOE <chr>,
#   Race <chr>, and abbreviated variable names ¹​Household_Number_Thousands,
#   ²​Total_Percent_Distribution, ³​`Under $15,000`, ⁴​`$15,000 to $24,999`,
#   ⁵​`$25,000 to $34,999`, ⁶​`35,000 to $49,999`, ⁷​`$50,000 to $74,999`,
#   ⁸​`$75,000 to $99,999`, ⁹​`$100,000 to $149,999`
Code
# Removing the footnote number next to the year value in the "Year" column

usa_data <- usa_data%>%
  mutate(Year = str_remove(Year, " .*"))

usa_data
# A tibble: 340 × 17
   Year  House…¹ Total…² Under…³ $15,0…⁴ $25,0…⁵ 35,00…⁶ $50,0…⁷ $75,0…⁸ $100,…⁹
   <chr> <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 2019  128451      100     9.1     8       8.3    11.7    16.5    12.3    15.5
 2 2018  128579      100    10.1     8.8     8.7    12      17      12.5    15  
 3 2017  127669      100    10       9.1     9.2    12      16.4    12.4    14.7
 4 2017  127586      100    10.1     9.1     9.2    11.9    16.3    12.6    14.8
 5 2016  126224      100    10.4     9       9.2    12.3    16.7    12.2    15  
 6 2015  125819      100    10.6    10       9.6    12.1    16.1    12.4    14.9
 7 2014  124587      100    11.4    10.5     9.6    12.6    16.4    12.1    14  
 8 2013  123931      100    11.4    10.3     9.5    12.5    16.8    12      13.9
 9 2013  122952      100    11.3    10.4     9.7    13.1    17      12.5    13.6
10 2012  122459      100    11.4    10.6    10.1    12.5    17.4    12      13.9
# … with 330 more rows, 7 more variables: `$150,000 to $199,999` <dbl>,
#   `$200,000 and over` <dbl>, Median_Income_Estimate <dbl>,
#   Median_Income_MOE <dbl>, Mean_Income_Estimate <chr>, Mean_Income_MOE <chr>,
#   Race <chr>, and abbreviated variable names ¹​Household_Number_Thousands,
#   ²​Total_Percent_Distribution, ³​`Under $15,000`, ⁴​`$15,000 to $24,999`,
#   ⁵​`$25,000 to $34,999`, ⁶​`35,000 to $49,999`, ⁷​`$50,000 to $74,999`,
#   ⁸​`$75,000 to $99,999`, ⁹​`$100,000 to $149,999`
Code
# Removing the footnote number next to the race value in the "Race" column

usa_data <- usa_data%>%
  mutate(Race = str_remove(Race, "[0-9]+"))

usa_data
# A tibble: 340 × 17
   Year  House…¹ Total…² Under…³ $15,0…⁴ $25,0…⁵ 35,00…⁶ $50,0…⁷ $75,0…⁸ $100,…⁹
   <chr> <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 2019  128451      100     9.1     8       8.3    11.7    16.5    12.3    15.5
 2 2018  128579      100    10.1     8.8     8.7    12      17      12.5    15  
 3 2017  127669      100    10       9.1     9.2    12      16.4    12.4    14.7
 4 2017  127586      100    10.1     9.1     9.2    11.9    16.3    12.6    14.8
 5 2016  126224      100    10.4     9       9.2    12.3    16.7    12.2    15  
 6 2015  125819      100    10.6    10       9.6    12.1    16.1    12.4    14.9
 7 2014  124587      100    11.4    10.5     9.6    12.6    16.4    12.1    14  
 8 2013  123931      100    11.4    10.3     9.5    12.5    16.8    12      13.9
 9 2013  122952      100    11.3    10.4     9.7    13.1    17      12.5    13.6
10 2012  122459      100    11.4    10.6    10.1    12.5    17.4    12      13.9
# … with 330 more rows, 7 more variables: `$150,000 to $199,999` <dbl>,
#   `$200,000 and over` <dbl>, Median_Income_Estimate <dbl>,
#   Median_Income_MOE <dbl>, Mean_Income_Estimate <chr>, Mean_Income_MOE <chr>,
#   Race <chr>, and abbreviated variable names ¹​Household_Number_Thousands,
#   ²​Total_Percent_Distribution, ³​`Under $15,000`, ⁴​`$15,000 to $24,999`,
#   ⁵​`$25,000 to $34,999`, ⁶​`35,000 to $49,999`, ⁷​`$50,000 to $74,999`,
#   ⁸​`$75,000 to $99,999`, ⁹​`$100,000 to $149,999`
Code
# Remove the "Total_Percent_Distribution" column as that value is 100 for all observations

usa_data <- usa_data%>%
  subset(select = -c(3))

usa_data
# A tibble: 340 × 16
   Year  House…¹ Under…² $15,0…³ $25,0…⁴ 35,00…⁵ $50,0…⁶ $75,0…⁷ $100,…⁸ $150,…⁹
   <chr> <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 2019  128451      9.1     8       8.3    11.7    16.5    12.3    15.5     8.3
 2 2018  128579     10.1     8.8     8.7    12      17      12.5    15       7.2
 3 2017  127669     10       9.1     9.2    12      16.4    12.4    14.7     7.3
 4 2017  127586     10.1     9.1     9.2    11.9    16.3    12.6    14.8     7.5
 5 2016  126224     10.4     9       9.2    12.3    16.7    12.2    15       7.2
 6 2015  125819     10.6    10       9.6    12.1    16.1    12.4    14.9     7.1
 7 2014  124587     11.4    10.5     9.6    12.6    16.4    12.1    14       6.6
 8 2013  123931     11.4    10.3     9.5    12.5    16.8    12      13.9     6.7
 9 2013  122952     11.3    10.4     9.7    13.1    17      12.5    13.6     6.3
10 2012  122459     11.4    10.6    10.1    12.5    17.4    12      13.9     6.3
# … with 330 more rows, 6 more variables: `$200,000 and over` <dbl>,
#   Median_Income_Estimate <dbl>, Median_Income_MOE <dbl>,
#   Mean_Income_Estimate <chr>, Mean_Income_MOE <chr>, Race <chr>, and
#   abbreviated variable names ¹​Household_Number_Thousands, ²​`Under $15,000`,
#   ³​`$15,000 to $24,999`, ⁴​`$25,000 to $34,999`, ⁵​`35,000 to $49,999`,
#   ⁶​`$50,000 to $74,999`, ⁷​`$75,000 to $99,999`, ⁸​`$100,000 to $149,999`,
#   ⁹​`$150,000 to $199,999`
Code
# Reorder the columns in the usa_data dataframe so that "Race" is the first column followed by "Year"

usa_data <- usa_data%>%
  subset(select=c(16, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15))

usa_data
# A tibble: 340 × 16
   Race    Year  House…¹ Under…² $15,0…³ $25,0…⁴ 35,00…⁵ $50,0…⁶ $75,0…⁷ $100,…⁸
   <chr>   <chr> <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 ALL RA… 2019  128451      9.1     8       8.3    11.7    16.5    12.3    15.5
 2 ALL RA… 2018  128579     10.1     8.8     8.7    12      17      12.5    15  
 3 ALL RA… 2017  127669     10       9.1     9.2    12      16.4    12.4    14.7
 4 ALL RA… 2017  127586     10.1     9.1     9.2    11.9    16.3    12.6    14.8
 5 ALL RA… 2016  126224     10.4     9       9.2    12.3    16.7    12.2    15  
 6 ALL RA… 2015  125819     10.6    10       9.6    12.1    16.1    12.4    14.9
 7 ALL RA… 2014  124587     11.4    10.5     9.6    12.6    16.4    12.1    14  
 8 ALL RA… 2013  123931     11.4    10.3     9.5    12.5    16.8    12      13.9
 9 ALL RA… 2013  122952     11.3    10.4     9.7    13.1    17      12.5    13.6
10 ALL RA… 2012  122459     11.4    10.6    10.1    12.5    17.4    12      13.9
# … with 330 more rows, 6 more variables: `$150,000 to $199,999` <dbl>,
#   `$200,000 and over` <dbl>, Median_Income_Estimate <dbl>,
#   Median_Income_MOE <dbl>, Mean_Income_Estimate <chr>, Mean_Income_MOE <chr>,
#   and abbreviated variable names ¹​Household_Number_Thousands,
#   ²​`Under $15,000`, ³​`$15,000 to $24,999`, ⁴​`$25,000 to $34,999`,
#   ⁵​`35,000 to $49,999`, ⁶​`$50,000 to $74,999`, ⁷​`$75,000 to $99,999`,
#   ⁸​`$100,000 to $149,999`
Code
# Find number of distinct Year values
n_distinct(usa_data$Year)
[1] 53
Code
# Sanity check: Check that the unique Year values range from 1967 - 2019 after data cleaning 
unique(usa_data$Year)
 [1] "2019" "2018" "2017" "2016" "2015" "2014" "2013" "2012" "2011" "2010"
[11] "2009" "2008" "2007" "2006" "2005" "2004" "2003" "2002" "2001" "2000"
[21] "1999" "1998" "1997" "1996" "1995" "1994" "1993" "1992" "1991" "1990"
[31] "1989" "1988" "1987" "1986" "1985" "1984" "1983" "1982" "1981" "1980"
[41] "1979" "1978" "1977" "1976" "1975" "1974" "1973" "1972" "1971" "1970"
[51] "1969" "1968" "1967"
Code
# Find number of distinct Race values
n_distinct(usa_data$Race)
[1] 12
Code
# Sanity check: Check that the unique Race values do not have any footnote numbers after data cleaning 
unique(usa_data$Race)
 [1] "ALL RACES"                     "WHITE ALONE "                 
 [3] "WHITE "                        "WHITE ALONE, NOT HISPANIC "   
 [5] "WHITE, NOT HISPANIC "          "BLACK ALONE OR IN COMBINATION"
 [7] "BLACK ALONE "                  "BLACK "                       
 [9] "ASIAN ALONE OR IN COMBINATION" "ASIAN ALONE "                 
[11] "ASIAN AND PACIFIC ISLANDER "   "HISPANIC (ANY RACE) "         
Code
#Summary of usa_data
library(summarytools)
print(summarytools::dfSummary(usa_data,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.60, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

usa_data

Dimensions: 340 x 16
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
Race [character]
1. ALL RACES
2. HISPANIC (ANY RACE) 
·
3. BLACK 
·
4. WHITE 
·
5. WHITE, NOT HISPANIC 
·
6. ASIAN ALONE 
·
7. ASIAN ALONE OR IN COMBINA
8. BLACK ALONE 
·
9. BLACK ALONE OR IN COMBINA
10. WHITE ALONE 
·
[ 2 others ]
55(16.2%)
50(14.7%)
35(10.3%)
35(10.3%)
30(8.8%)
20(5.9%)
20(5.9%)
20(5.9%)
20(5.9%)
20(5.9%)
35(10.3%)
0 (0.0%)
Year [character]
1. 2013
2. 2017
3. 2002
4. 2003
5. 2004
6. 2005
7. 2006
8. 2007
9. 2008
10. 2009
[ 43 others ]
16(4.7%)
16(4.7%)
8(2.4%)
8(2.4%)
8(2.4%)
8(2.4%)
8(2.4%)
8(2.4%)
8(2.4%)
8(2.4%)
244(71.8%)
0 (0.0%)
Household_Number_Thousands [character]
1. 6750
2. 100065
3. 100113
4. 10034
5. 100528
6. 100568
7. 101018
8. 10192
9. 102528
10. 103874
[ 329 others ]
2(0.6%)
1(0.3%)
1(0.3%)
1(0.3%)
1(0.3%)
1(0.3%)
1(0.3%)
1(0.3%)
1(0.3%)
1(0.3%)
329(96.8%)
0 (0.0%)
Under $15,000 [numeric]
Mean (sd) : 13.1 (5.2)
min ≤ med ≤ max:
6.3 ≤ 10.9 ≤ 27.2
IQR (CV) : 7.2 (0.4)
142 distinct values 0 (0.0%)
$15,000 to $24,999 [numeric]
Mean (sd) : 10.7 (2.7)
min ≤ med ≤ max:
5 ≤ 10.1 ≤ 17.7
IQR (CV) : 3.8 (0.3)
104 distinct values 0 (0.0%)
$25,000 to $34,999 [numeric]
Mean (sd) : 10.1 (2.1)
min ≤ med ≤ max:
5 ≤ 9.9 ≤ 15.5
IQR (CV) : 2.7 (0.2)
87 distinct values 0 (0.0%)
35,000 to $49,999 [numeric]
Mean (sd) : 13.5 (2.2)
min ≤ med ≤ max:
7.7 ≤ 13.8 ≤ 20.6
IQR (CV) : 2.3 (0.2)
93 distinct values 0 (0.0%)
$50,000 to $74,999 [numeric]
Mean (sd) : 18 (2.4)
min ≤ med ≤ max:
12.9 ≤ 17.5 ≤ 25.8
IQR (CV) : 2.9 (0.1)
92 distinct values 0 (0.0%)
$75,000 to $99,999 [numeric]
Mean (sd) : 12.2 (2.1)
min ≤ med ≤ max:
5.5 ≤ 12.6 ≤ 16.3
IQR (CV) : 3.4 (0.2)
86 distinct values 0 (0.0%)
$100,000 to $149,999 [numeric]
Mean (sd) : 12.6 (4)
min ≤ med ≤ max:
2.7 ≤ 13.1 ≤ 19.9
IQR (CV) : 6.4 (0.3)
130 distinct values 0 (0.0%)
$150,000 to $199,999 [numeric]
Mean (sd) : 5.1 (2.9)
min ≤ med ≤ max:
0.4 ≤ 4.3 ≤ 12.5
IQR (CV) : 4.1 (0.6)
104 distinct values 0 (0.0%)
$200,000 and over [numeric]
Mean (sd) : 4.7 (3.8)
min ≤ med ≤ max:
0.1 ≤ 3.5 ≤ 18.9
IQR (CV) : 4.9 (0.8)
115 distinct values 0 (0.0%)
Median_Income_Estimate [numeric]
Mean (sd) : 55534.5 (14465.9)
min ≤ med ≤ max:
29026 ≤ 56101.5 ≤ 98174
IQR (CV) : 22391.5 (0.3)
339 distinct values 0 (0.0%)
Median_Income_MOE [numeric]
Mean (sd) : 1143.7 (996.3)
min ≤ med ≤ max:
268 ≤ 813 ≤ 6080
IQR (CV) : 839.5 (0.9)
308 distinct values 0 (0.0%)
Mean_Income_Estimate [character]
1. 100041
2. 100399
3. 101732
4. 101962
5. 102300
6. 102588
7. 102752
8. 103291
9. 103725
10. 103815
[ 330 others ]
1(0.3%)
1(0.3%)
1(0.3%)
1(0.3%)
1(0.3%)
1(0.3%)
1(0.3%)
1(0.3%)
1(0.3%)
1(0.3%)
330(97.1%)
0 (0.0%)
Mean_Income_MOE [character]
1. N
2. 1033
3. 1167
4. 1240
5. 1333
6. 1371
7. 3050
8. 306
9. 310
10. 316
[ 301 others ]
3(0.9%)
2(0.6%)
2(0.6%)
2(0.6%)
2(0.6%)
2(0.6%)
2(0.6%)
2(0.6%)
2(0.6%)
2(0.6%)
319(93.8%)
0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-21

After reading the data and renaming the columns, we have 383 rows and 16 columns in the dataframe. Out of the 383 rows, the last 31 rows are footnotes and are removed resulting in 352 rows and 16 columns. The first column “Year” has information on the Year and Race and Hispanic origin of householder. The race value in the first column is currently used as a separator for the different race groups in the data. Hence, I made a new column called “Race” and filled the column with the race value if available, else filled with the previous race value. We get a total of 12 distinct race groups. Following that, I removed the rows from the dataframe that had non-numerical values in the first column as these rows with the Race value were just used as a divider between race groups. This results in a dataframe of 340 rows and 17 columns. Now, we have separate columns for “Year” and “Race”. Next, I removed the footnote numbers next to the values in the “Year” and “Race” columns. I also removed the column “Total_Percent_Distribution” as the value is 100 for all observations and is not significant. If we add the percentage value for all 9 income levels in an observation it should add up to 100. Finally, I reordered the columns so that “Race” is the first column in the dataframe followed by “Year” and the remaining columns. After cleaning the data, I end up with a dataframe of 340 observations and 16 columns/attributes. I did a sanity check to make sure that we have data for the period 1967 - 2019 and we have 12 race categories in total. Currently, I have not removed the observations with race value as “ALL RACES” as I am not sure whether they include the other race categories. I summarized the data using dfSummary() function and made sure that there are no duplicates or missing values in the data.

Briefly describe the data

Describe the data, and be sure to comment on why you are planning to pivot it to make it “tidy”

The USA household data contains information on the mean and median income grouped by Race and Hispanic Origin of householder for the period 1967 - 2019. The data is split into 12 different categories based on Races and we have the total number of households surveyed in a given year along with the total percentage distribution (100 for all observations), percentage distribution for various income levels, and the mean and median (estimated and margin of error) income. For some races, data is not available for all the years in the period 1967 - 2019.

Currently the data has multiple observations combined in a single row and this makes it difficult to perform data manipulation operations like filtering, grouping, etc. Hence, we need to pivot the data to make it possible to group the data and calculate aggregates/statistics for the 12 different race categories corresponding to the years 1967 - 2019. We can pivot the data to find the percentage of each income level for given race and year and visualize them as individual observations. This will also help in plotting graphs to visualize trends.

Anticipate the End Result

The first step in pivoting the data is to try to come up with a concrete vision of what the end product should look like - that way you will know whether or not your pivoting was successful.

One easy way to do this is to think about the dimensions of your current data (tibble, dataframe, or matrix), and then calculate what the dimensions of the pivoted data should be.

Suppose you have a dataset with \(n\) rows and \(k\) variables. In our example, 3 of the variables are used to identify a case, so you will be pivoting \(k-3\) variables into a longer format where the \(k-3\) variable names will move into the names_to variable and the current values in each of those columns will move into the values_to variable. Therefore, we would expect \(n * (k-3)\) rows in the pivoted dataframe!

Challenge: Describe the final dimensions

Document your work here.

Code
# Printing the dataframe
usa_data
# A tibble: 340 × 16
   Race    Year  House…¹ Under…² $15,0…³ $25,0…⁴ 35,00…⁵ $50,0…⁶ $75,0…⁷ $100,…⁸
   <chr>   <chr> <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 ALL RA… 2019  128451      9.1     8       8.3    11.7    16.5    12.3    15.5
 2 ALL RA… 2018  128579     10.1     8.8     8.7    12      17      12.5    15  
 3 ALL RA… 2017  127669     10       9.1     9.2    12      16.4    12.4    14.7
 4 ALL RA… 2017  127586     10.1     9.1     9.2    11.9    16.3    12.6    14.8
 5 ALL RA… 2016  126224     10.4     9       9.2    12.3    16.7    12.2    15  
 6 ALL RA… 2015  125819     10.6    10       9.6    12.1    16.1    12.4    14.9
 7 ALL RA… 2014  124587     11.4    10.5     9.6    12.6    16.4    12.1    14  
 8 ALL RA… 2013  123931     11.4    10.3     9.5    12.5    16.8    12      13.9
 9 ALL RA… 2013  122952     11.3    10.4     9.7    13.1    17      12.5    13.6
10 ALL RA… 2012  122459     11.4    10.6    10.1    12.5    17.4    12      13.9
# … with 330 more rows, 6 more variables: `$150,000 to $199,999` <dbl>,
#   `$200,000 and over` <dbl>, Median_Income_Estimate <dbl>,
#   Median_Income_MOE <dbl>, Mean_Income_Estimate <chr>, Mean_Income_MOE <chr>,
#   and abbreviated variable names ¹​Household_Number_Thousands,
#   ²​`Under $15,000`, ³​`$15,000 to $24,999`, ⁴​`$25,000 to $34,999`,
#   ⁵​`35,000 to $49,999`, ⁶​`$50,000 to $74,999`, ⁷​`$75,000 to $99,999`,
#   ⁸​`$100,000 to $149,999`
Code
# Removing the columns containing mean and median income

usa_data_final <- usa_data%>%
  subset(select=(-c(13,14,15,16)))

usa_data_final
# A tibble: 340 × 12
   Race    Year  House…¹ Under…² $15,0…³ $25,0…⁴ 35,00…⁵ $50,0…⁶ $75,0…⁷ $100,…⁸
   <chr>   <chr> <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 ALL RA… 2019  128451      9.1     8       8.3    11.7    16.5    12.3    15.5
 2 ALL RA… 2018  128579     10.1     8.8     8.7    12      17      12.5    15  
 3 ALL RA… 2017  127669     10       9.1     9.2    12      16.4    12.4    14.7
 4 ALL RA… 2017  127586     10.1     9.1     9.2    11.9    16.3    12.6    14.8
 5 ALL RA… 2016  126224     10.4     9       9.2    12.3    16.7    12.2    15  
 6 ALL RA… 2015  125819     10.6    10       9.6    12.1    16.1    12.4    14.9
 7 ALL RA… 2014  124587     11.4    10.5     9.6    12.6    16.4    12.1    14  
 8 ALL RA… 2013  123931     11.4    10.3     9.5    12.5    16.8    12      13.9
 9 ALL RA… 2013  122952     11.3    10.4     9.7    13.1    17      12.5    13.6
10 ALL RA… 2012  122459     11.4    10.6    10.1    12.5    17.4    12      13.9
# … with 330 more rows, 2 more variables: `$150,000 to $199,999` <dbl>,
#   `$200,000 and over` <dbl>, and abbreviated variable names
#   ¹​Household_Number_Thousands, ²​`Under $15,000`, ³​`$15,000 to $24,999`,
#   ⁴​`$25,000 to $34,999`, ⁵​`35,000 to $49,999`, ⁶​`$50,000 to $74,999`,
#   ⁷​`$75,000 to $99,999`, ⁸​`$100,000 to $149,999`
Code
#existing rows/cases
nrow(usa_data_final)
[1] 340
Code
#existing columns/cases
ncol(usa_data_final)
[1] 12
Code
#expected rows/cases
nrow(usa_data_final) * (ncol(usa_data_final)-3)
[1] 3060
Code
# expected columns 
3 + 2
[1] 5

Any additional comments?

I plan on pivoting the data using pivot_longer() function (increases the number of rows and decreases the number of columns) and the resultant pivot table will have way more rows in comparison to the less number of columns. Doing this will increase the ease of comparison for different income levels for given race and year.

Currently, the dataframe contains 340 rows and 16 columns. When we pivot the data, the mean and median income will be redundant for each observation and makes comparison strenuous. Hence, I subset the data and remove the columns containing mean and median income. We can use those attributes in the next steps by creating a new pivot table using them. After subsetting the data, we are left with 340 rows and 12 columns. The pivot will be done on the 9 different income levels that will be represented as a single column called “Income Level” and the corresponding percentage values will be stored in a column called “Percent Value” keeping the “Race”, “Year”, and “Household_Number_Thousands” columns. Therefore, after pivoting we should expect 9 * 340 = 3060 rows and 3 + 2 = 5 columns. The 5 columns are “Race”, “Year”, “Household_Number_Thousands”, “Income Level” and “Percent Value”.

Pivot the Data

Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a “sanity” check.

Challenge: Pivot the Chosen Data

Document your work here. What will a new “case” be once you have pivoted the data? How does it meet requirements for tidy data?

Code
# Pivoting the data using pivot_longer()
df <- pivot_longer(usa_data_final, col = c(4:12),
                   names_to = "Income Level",
                   values_to = "Percent Value")
df
# A tibble: 3,060 × 5
   Race      Year  Household_Number_Thousands `Income Level`       Percent Val…¹
   <chr>     <chr> <chr>                      <chr>                        <dbl>
 1 ALL RACES 2019  128451                     Under $15,000                  9.1
 2 ALL RACES 2019  128451                     $15,000 to $24,999             8  
 3 ALL RACES 2019  128451                     $25,000 to $34,999             8.3
 4 ALL RACES 2019  128451                     35,000 to $49,999             11.7
 5 ALL RACES 2019  128451                     $50,000 to $74,999            16.5
 6 ALL RACES 2019  128451                     $75,000 to $99,999            12.3
 7 ALL RACES 2019  128451                     $100,000 to $149,999          15.5
 8 ALL RACES 2019  128451                     $150,000 to $199,999           8.3
 9 ALL RACES 2019  128451                     $200,000 and over             10.3
10 ALL RACES 2018  128579                     Under $15,000                 10.1
# … with 3,050 more rows, and abbreviated variable name ¹​`Percent Value`
Code
# Checking dimension of the pivot table
dim(df)
[1] 3060    5

Any additional comments?

The pivoted table has 3060 rows/observations and 5 columns which is same as the expected rows and expected columns we had predicted in the previous section. In the above pivot table, we are able to observe the count of Households and the percentage value for the 9 different income levels for each Race and Year.

Source Code
---
title: "Challenge 3 Solutions"
author: "Vinitha Maheswaran"
desription: "Tidy Data: Pivoting"
date: "11/24/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_3
  - animal_weights
  - eggs
  - australian_marriage
  - usa_households
  - sce_labor
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

## Challenge Overview

Today's challenge is to:

1.  read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
2.  identify what needs to be done to tidy the current data
3.  anticipate the shape of pivoted data
4.  pivot the data into tidy format using `pivot_longer`

## Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

-   animal_weights.csv ⭐
-   eggs_tidy.csv ⭐⭐ or organiceggpoultry.xls ⭐⭐⭐
-   australian_marriage\*.xls ⭐⭐⭐
-   USA Households\*.xlsx ⭐⭐⭐⭐
-   sce_labor_chart_data_public.xlsx 🌟🌟🌟🌟🌟


For this challenge I will be working with the "USA Households\*.xlsx" data set.

```{r}
library(readxl)

# Reading the USA Households\*.xlsx data set and storing in a data frame

column_names = c("Year", "Household_Number_Thousands", "Total_Percent_Distribution", "Under $15,000", "$15,000 to $24,999", "$25,000 to $34,999", "35,000 to $49,999", "$50,000 to $74,999", "$75,000 to $99,999", "$100,000 to $149,999", "$150,000 to $199,999", "$200,000 and over", "Median_Income_Estimate", "Median_Income_MOE", "Mean_Income_Estimate", "Mean_Income_MOE")
usa_data <- read_excel("_data/USA Households by Total Money Income, Race, and Hispanic Origin of Householder 1967 to 2019.xlsx", col_names = column_names, skip = 5)
print(usa_data)
```

Since the “USA Households\*.xlsx” data set is in Excel format, I am using the 'readxl' package for reading the data. After reading, the data is stored in a dataframe “usa_data". The first three rows in the dataframe contains description about the data and the 4th and 5th row contains the column headings. I resolve this issue by skipping the first 5 rows while reading the data set and storing in dataframe with the renamed column names.


## Data Cleaning


```{r}
# Removing the last 31 rows as they are just footnotes and not observations
usa_data <- head(usa_data,-31)
print(usa_data)
```

```{r}
# Dimensions of usa_data
dim(usa_data)
```


```{r}
# Creating new column for Race and filling the empty values for Race with the previous value in that column

usa_data <- usa_data%>%
  mutate(Race = case_when(str_detect(Year,("([A-Z])")) ~ Year))%>%
  fill(Race, .direction = "down")

# Removing the rows from usa_data which has non-numerical values in Year column (these rows have Race value in the Year column and were inserted as separators for different Race groups)

usa_data <- usa_data%>%
  filter(!str_detect(Year,("([A-Z])")))

print(usa_data)
```

```{r}
# Removing the footnote number next to the year value in the "Year" column

usa_data <- usa_data%>%
  mutate(Year = str_remove(Year, " .*"))

usa_data
```

```{r}
# Removing the footnote number next to the race value in the "Race" column

usa_data <- usa_data%>%
  mutate(Race = str_remove(Race, "[0-9]+"))

usa_data
```

```{r}
# Remove the "Total_Percent_Distribution" column as that value is 100 for all observations

usa_data <- usa_data%>%
  subset(select = -c(3))

usa_data
```

```{r}
# Reorder the columns in the usa_data dataframe so that "Race" is the first column followed by "Year"

usa_data <- usa_data%>%
  subset(select=c(16, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15))

usa_data
```



```{r}
# Find number of distinct Year values
n_distinct(usa_data$Year)

# Sanity check: Check that the unique Year values range from 1967 - 2019 after data cleaning 
unique(usa_data$Year)
```


```{r}
# Find number of distinct Race values
n_distinct(usa_data$Race)

# Sanity check: Check that the unique Race values do not have any footnote numbers after data cleaning 
unique(usa_data$Race)
```


```{r}
#Summary of usa_data
library(summarytools)
print(summarytools::dfSummary(usa_data,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.60, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')
```


After reading the data and renaming the columns, we have 383 rows and 16 columns in the dataframe. Out of the 383 rows, the last 31 rows are footnotes and are removed resulting in 352 rows and 16 columns. The first column “Year” has information on the Year and Race and Hispanic origin of householder. The race value in the first column is currently used as a separator for the different race groups in the data. Hence, I made a new column called “Race” and filled the column with the race value if available, else filled with the previous race value. We get a total of 12 distinct race groups. Following that, I removed the rows from the dataframe that had non-numerical values in the first column as these rows with the Race value were just used as a divider between race groups. This results in a dataframe of 340 rows and 17 columns. Now, we have separate columns for “Year” and “Race”. Next, I removed the footnote numbers next to the values in the “Year” and “Race” columns. I also removed the column “Total_Percent_Distribution” as the value is 100 for all observations and is not significant. If we add the percentage value for all 9 income levels in an observation it should add up to 100. Finally, I reordered the columns so that “Race” is the first column in the dataframe followed by “Year” and the remaining columns. After cleaning the data, I end up with a dataframe of 340 observations and 16 columns/attributes. I did a sanity check to make sure that we have data for the period 1967 - 2019 and we have 12 race categories in total. Currently, I have not removed the observations with race value as "ALL RACES" as I am not sure whether they include the other race categories. I summarized the data using dfSummary() function and made sure that there are no duplicates or missing values in the data.



### Briefly describe the data

Describe the data, and be sure to comment on why you are planning to pivot it to make it "tidy"

The USA household data contains information on the mean and median income grouped by Race and Hispanic Origin of householder for the period 1967 - 2019. The data is split into 12 different categories based on Races and we have the total number of households surveyed in a given year along with the total percentage distribution (100 for all observations), percentage distribution for various income levels, and the mean and median (estimated and margin of error) income. For some races, data is not available for all the years in the period 1967 - 2019.

Currently the data has multiple observations combined in a single row and this makes it difficult to perform data manipulation operations like filtering, grouping, etc. Hence, we need to pivot the data to make it possible to group the data and calculate aggregates/statistics for the 12 different race categories corresponding to the years 1967 - 2019.  We can pivot the data to find the percentage of each income level for given race and year and visualize them as individual observations. This will also help in plotting graphs to visualize trends.



## Anticipate the End Result

The first step in pivoting the data is to try to come up with a concrete vision of what the end product *should* look like - that way you will know whether or not your pivoting was successful.

One easy way to do this is to think about the dimensions of your current data (tibble, dataframe, or matrix), and then calculate what the dimensions of the pivoted data should be.

Suppose you have a dataset with $n$ rows and $k$ variables. In our example, 3 of the variables are used to identify a case, so you will be pivoting $k-3$ variables into a longer format where the $k-3$ variable names will move into the `names_to` variable and the current values in each of those columns will move into the `values_to` variable. Therefore, we would expect $n * (k-3)$ rows in the pivoted dataframe!



### Challenge: Describe the final dimensions

Document your work here.

```{r}
# Printing the dataframe
usa_data
```
```{r}
# Removing the columns containing mean and median income

usa_data_final <- usa_data%>%
  subset(select=(-c(13,14,15,16)))

usa_data_final
```


```{r}
#existing rows/cases
nrow(usa_data_final)
```

```{r}
#existing columns/cases
ncol(usa_data_final)
```

```{r}
#expected rows/cases
nrow(usa_data_final) * (ncol(usa_data_final)-3)
```

```{r}
# expected columns 
3 + 2
```


Any additional comments?


I plan on pivoting the data using pivot_longer() function (increases the number of rows and decreases the number of columns) and the resultant pivot table will have way more rows in comparison to the less number of columns. Doing this will increase the ease of comparison for different income levels for given race and year. 

Currently, the dataframe contains 340 rows and 16 columns. When we pivot the data, the mean and median income will be redundant for each observation and makes comparison strenuous. Hence, I subset the data and remove the columns containing mean and median income. We can use those attributes in the next steps by creating a new pivot table using them.  After subsetting the data, we are left with 340 rows and 12 columns. The pivot will be done on the 9 different income levels that will be represented as a single column called “Income Level” and the corresponding percentage values will be stored in a column called “Percent Value” keeping the “Race”, “Year”, and “Household_Number_Thousands” columns. Therefore, after pivoting we should expect 9 * 340 = 3060 rows and 3 + 2 = 5 columns. The 5 columns are “Race”, “Year”, “Household_Number_Thousands”, “Income Level” and “Percent Value”.



## Pivot the Data

Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a "sanity" check.



### Challenge: Pivot the Chosen Data

Document your work here. What will a new "case" be once you have pivoted the data? How does it meet requirements for tidy data?

```{r}
# Pivoting the data using pivot_longer()
df <- pivot_longer(usa_data_final, col = c(4:12),
                   names_to = "Income Level",
                   values_to = "Percent Value")
df
```

```{r}
# Checking dimension of the pivot table
dim(df)
```


Any additional comments?

The pivoted table has 3060 rows/observations and 5 columns which is same as the expected rows and expected columns we had predicted in the previous section. In the above pivot table, we are able to observe the count of Households and the percentage value for the 9 different income levels for each Race and Year.