challenge_3
Author

Quinn He

Published

August 19, 2022

Code
library(tidyverse)
library(readxl)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Note

Go to the “Eggs” label if you don’t want to see me fumble through a harder data set.

Read in data

Code
household <-read_excel("_data/USA Households by Total Money Income, Race, and Hispanic Origin of Householder 1967 to 2019.xlsx", 
                skip = 4)

household
# A tibble: 383 × 16
   ...1      ...2  Total Under…¹ $15,0…² $25,0…³ $35,0…⁴ $50,0…⁵ $75,0…⁶ $100,…⁷
   <chr>     <chr> <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 ALL RACES <NA>     NA    NA      NA      NA      NA      NA      NA      NA  
 2 2019      1284…   100     9.1     8       8.3    11.7    16.5    12.3    15.5
 3 2018      1285…   100    10.1     8.8     8.7    12      17      12.5    15  
 4 2017 2    1276…   100    10       9.1     9.2    12      16.4    12.4    14.7
 5 2017      1275…   100    10.1     9.1     9.2    11.9    16.3    12.6    14.8
 6 2016      1262…   100    10.4     9       9.2    12.3    16.7    12.2    15  
 7 2015      1258…   100    10.6    10       9.6    12.1    16.1    12.4    14.9
 8 2014      1245…   100    11.4    10.5     9.6    12.6    16.4    12.1    14  
 9 2013 3    1239…   100    11.4    10.3     9.5    12.5    16.8    12      13.9
10 2013 4    1229…   100    11.3    10.4     9.7    13.1    17      12.5    13.6
# … with 373 more rows, 6 more variables: `$150,000\r\nto\r\n$199,999` <dbl>,
#   `$200,000 and over` <dbl>, Estimate...13 <dbl>,
#   `Margin of error1 (±)...14` <dbl>, Estimate...15 <chr>,
#   `Margin of error1 (±)...16` <chr>, and abbreviated variable names
#   ¹​`Under $15,000`, ²​`$15,000\r\nto\r\n$24,999`, ³​`$25,000\r\nto\r\n$34,999`,
#   ⁴​`$35,000\r\nto\r\n$49,999`, ⁵​`$50,000\r\nto\r\n$74,999`,
#   ⁶​`$75,000\r\nto\r\n$99,999`, ⁷​`$100,000\r\nto\r\n$149,999`
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

Briefly describe the data

This is a data set that contains household income by race for homeowners from 1967 to 2019. The data set is a mess with the first column containing the years listed in descending order followed by the next race in the data set. Also, the columns are listed so poorly that I had to skip some in the read-in section. One thing I notice just by combing through the data set, Asian Americans have a much higher number of people making over $200,000 in household income, but this excludes Asian Pacific Islanders. Notes at the bottom also have to be removed. Before I can really see any trends in the data, I need to clean it and organize it. Right now we do not have tidy data!

Below, the column names do not make any sense and it is clearly not tidy. By changing some of the names, I hope to make it easier to manipulate later on.

Code
colnames(household)
 [1] "...1"                       "...2"                      
 [3] "Total"                      "Under $15,000"             
 [5] "$15,000\r\nto\r\n$24,999"   "$25,000\r\nto\r\n$34,999"  
 [7] "$35,000\r\nto\r\n$49,999"   "$50,000\r\nto\r\n$74,999"  
 [9] "$75,000\r\nto\r\n$99,999"   "$100,000\r\nto\r\n$149,999"
[11] "$150,000\r\nto\r\n$199,999" "$200,000 and over"         
[13] "Estimate...13"              "Margin of error1 (±)...14" 
[15] "Estimate...15"              "Margin of error1 (±)...16" 
Code
household <- household %>% 
  rename("year" = "...1", "num_thousands" = "...2", "median_income" = "Estimate...13", "mean_income" = "Estimate...15")

Okay, so I changed some of the names so it’s a little neater. Now, at least the columns, are easier to read, but there is still the problem of the “year” column. It contains all the races, as well as the years in the data set so it’s much more difficult to look at individual races. There are also random numbers at the end of some of the years.

Code
household <- household %>% 
  mutate(year = str_remove(year, " [0:28]"))

household
# A tibble: 383 × 16
   year    num_t…¹ Total Under…² $15,0…³ $25,0…⁴ $35,0…⁵ $50,0…⁶ $75,0…⁷ $100,…⁸
   <chr>   <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 ALL RA… <NA>       NA    NA      NA      NA      NA      NA      NA      NA  
 2 2019    128451    100     9.1     8       8.3    11.7    16.5    12.3    15.5
 3 2018    128579    100    10.1     8.8     8.7    12      17      12.5    15  
 4 2017    127669    100    10       9.1     9.2    12      16.4    12.4    14.7
 5 2017    127586    100    10.1     9.1     9.2    11.9    16.3    12.6    14.8
 6 2016    126224    100    10.4     9       9.2    12.3    16.7    12.2    15  
 7 2015    125819    100    10.6    10       9.6    12.1    16.1    12.4    14.9
 8 2014    124587    100    11.4    10.5     9.6    12.6    16.4    12.1    14  
 9 2013 3  123931    100    11.4    10.3     9.5    12.5    16.8    12      13.9
10 2013 4  122952    100    11.3    10.4     9.7    13.1    17      12.5    13.6
# … with 373 more rows, 6 more variables: `$150,000\r\nto\r\n$199,999` <dbl>,
#   `$200,000 and over` <dbl>, median_income <dbl>,
#   `Margin of error1 (±)...14` <dbl>, mean_income <chr>,
#   `Margin of error1 (±)...16` <chr>, and abbreviated variable names
#   ¹​num_thousands, ²​`Under $15,000`, ³​`$15,000\r\nto\r\n$24,999`,
#   ⁴​`$25,000\r\nto\r\n$34,999`, ⁵​`$35,000\r\nto\r\n$49,999`,
#   ⁶​`$50,000\r\nto\r\n$74,999`, ⁷​`$75,000\r\nto\r\n$99,999`, …
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

In the above function, I try to remove the excess numbers after the years in the “years” column, but I cannot figure out why the numbers still remain. It’s necessary I remove them in the future, but for now I will push on.

Sanity Check

Here are our dimensions for the data before it’s pivoted

Code
dim(household)
[1] 383  16

Challenge: Pivot the Chosen Data

Document your work here. What will a new “case” be once you have pivoted the data? How does it meet requirements for tidy data?

Code
household2 <- pivot_longer(household, cols = 4:12,
                          names_to = "income",
                          values_to = "count")

household2
# A tibble: 3,447 × 9
   year      num_thousands Total median_i…¹ Margi…² mean_…³ Margi…⁴ income count
   <chr>     <chr>         <dbl>      <dbl>   <dbl> <chr>   <chr>   <chr>  <dbl>
 1 ALL RACES <NA>             NA         NA      NA <NA>    <NA>    "Unde…  NA  
 2 ALL RACES <NA>             NA         NA      NA <NA>    <NA>    "$15,…  NA  
 3 ALL RACES <NA>             NA         NA      NA <NA>    <NA>    "$25,…  NA  
 4 ALL RACES <NA>             NA         NA      NA <NA>    <NA>    "$35,…  NA  
 5 ALL RACES <NA>             NA         NA      NA <NA>    <NA>    "$50,…  NA  
 6 ALL RACES <NA>             NA         NA      NA <NA>    <NA>    "$75,…  NA  
 7 ALL RACES <NA>             NA         NA      NA <NA>    <NA>    "$100…  NA  
 8 ALL RACES <NA>             NA         NA      NA <NA>    <NA>    "$150…  NA  
 9 ALL RACES <NA>             NA         NA      NA <NA>    <NA>    "$200…  NA  
10 2019      128451          100      68703     904 98088   1042    "Unde…   9.1
# … with 3,437 more rows, and abbreviated variable names ¹​median_income,
#   ²​`Margin of error1 (±)...14`, ³​mean_income, ⁴​`Margin of error1 (±)...16`
# ℹ Use `print(n = ...)` to see more rows

Well, that turned out horribly. I’m going to leave that mistake here and move on to try and fix that. Clearly, the data is anything but tidy.

Let’s try that again below.

Code
household3 <- pivot_longer(household, 4:12, names_to = "income_brackets", values_to = "count")

Switch to Eggs Data Set

I’m going to switch data sets because I think I am a little over my head in this data frame, from here on out I will be working with “Eggs”. Below I will run through the assignment quicker than above to save you some already spent time.

Code
eggs <- read_excel("_data/organiceggpoultry.xls",
                   sheet = "Data",
                   skip = 4,
                   range =cell_limits(c(6,2),c(NA,6)),
                  col_names = c("date", "xldozen", "xlhalf_dozen", "large_dozen", "large_half_dozen"))

eggs
# A tibble: 120 × 5
   date      xldozen xlhalf_dozen large_dozen large_half_dozen
   <chr>       <dbl>        <dbl>       <dbl>            <dbl>
 1 Jan 2004     230          132         230              126 
 2 February     230          134.        226.             128.
 3 March        230          137         225              131 
 4 April        234.         137         225              131 
 5 May          236          137         225              131 
 6 June         241          137         231.             134.
 7 July         241          137         234.             134.
 8 August       241          137         234.             134.
 9 September    241          136.        234.             130.
10 October      241          136.        234.             128.
# … with 110 more rows
# ℹ Use `print(n = ...)` to see more rows

Above I read in the data set as an excel file. I had to look at the solutions sheet to learn how to read in the data. The only trouble I had was figuring out the “range” parameter of the function. That one I just had to copy in because I was getting an error without it, but I understand that this tells R which cells to read.

Code
eggs %>% 
  select("date")  %>% 
  distinct()
# A tibble: 22 × 1
   date     
   <chr>    
 1 Jan 2004 
 2 February 
 3 March    
 4 April    
 5 May      
 6 June     
 7 July     
 8 August   
 9 September
10 October  
# … with 12 more rows
# ℹ Use `print(n = ...)` to see more rows

Now we still have the issue of notes in the names of our months. This is most evident with “Jan…”, but the ” /1” in February needs to go. This needs to be removed to make analysis later on a bit easier to look at.

Code
eggs <- eggs %>% 
  mutate(date = str_remove(date, " /1"))

Next, the January columns need to be dealt with so below I will remove the years in the “Jan” column with the separate and fill function.

Code
eggs <- eggs %>% 
  separate(date, c("month", "year"), convert = TRUE) %>% 
  fill("year")
  

eggs
# A tibble: 120 × 6
   month      year xldozen xlhalf_dozen large_dozen large_half_dozen
   <chr>     <int>   <dbl>        <dbl>       <dbl>            <dbl>
 1 Jan        2004    230          132         230              126 
 2 February   2004    230          134.        226.             128.
 3 March      2004    230          137         225              131 
 4 April      2004    234.         137         225              131 
 5 May        2004    236          137         225              131 
 6 June       2004    241          137         231.             134.
 7 July       2004    241          137         234.             134.
 8 August     2004    241          137         234.             134.
 9 September  2004    241          136.        234.             130.
10 October    2004    241          136.        234.             128.
# … with 110 more rows
# ℹ Use `print(n = ...)` to see more rows

There we go. Clean, easy to use data. You can see as the years progress, the price of eggs increase, though at what rate I am uncertain. Now the data set is read for some pivoting. Now let me do a sanity check to view the data dimensions before the pivot and then I will look at it after the pivot.

Code
dim(eggs)
[1] 120   6

There are four columns with the type of eggs this data set is viewing. If we want even tidier data we can collapse these four columns into one. Lets call this new data set “eggstidy” to represent the final form of the data in this challenge.

Code
eggstidy <- eggs %>% 
  pivot_longer(c(3:6), names_to = "egg_type", values_to = "price")

eggstidy
# A tibble: 480 × 4
   month     year egg_type         price
   <chr>    <int> <chr>            <dbl>
 1 Jan       2004 xldozen           230 
 2 Jan       2004 xlhalf_dozen      132 
 3 Jan       2004 large_dozen       230 
 4 Jan       2004 large_half_dozen  126 
 5 February  2004 xldozen           230 
 6 February  2004 xlhalf_dozen      134.
 7 February  2004 large_dozen       226.
 8 February  2004 large_half_dozen  128.
 9 March     2004 xldozen           230 
10 March     2004 xlhalf_dozen      137 
# … with 470 more rows
# ℹ Use `print(n = ...)` to see more rows

Finally! After banging my head against the wall with the household data set, reading in, cleaning, and pivoting this easier data set was much more manageable. It was helpful to look at the solution to how to read in this data set, but after that, it was easy to move on my own.

By pivoting the data, each row has one observation, making it tidy for future manipulation. Within each row we can look at the particular variables within that observation with ease.

Below the dimensions have clearly changed, adding significantly more rows, but also condensing the amount of columns.

Code
dim(eggstidy)
[1] 480   4