Code
library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Vinitha Maheswaran
November 24, 2022
Today’s challenge is to:
pivot_longer
Read in one (or more) of the following datasets, using the correct R package and command.
For this challenge I will be working with the “USA Households*.xlsx” data set.
library(readxl)
# Reading the USA Households\*.xlsx data set and storing in a data frame
column_names = c("Year", "Household_Number_Thousands", "Total_Percent_Distribution", "Under $15,000", "$15,000 to $24,999", "$25,000 to $34,999", "35,000 to $49,999", "$50,000 to $74,999", "$75,000 to $99,999", "$100,000 to $149,999", "$150,000 to $199,999", "$200,000 and over", "Median_Income_Estimate", "Median_Income_MOE", "Mean_Income_Estimate", "Mean_Income_MOE")
usa_data <- read_excel("_data/USA Households by Total Money Income, Race, and Hispanic Origin of Householder 1967 to 2019.xlsx", col_names = column_names, skip = 5)
print(usa_data)
# A tibble: 383 × 16
Year House…¹ Total…² Under…³ $15,0…⁴ $25,0…⁵ 35,00…⁶ $50,0…⁷ $75,0…⁸ $100,…⁹
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ALL … <NA> NA NA NA NA NA NA NA NA
2 2019 128451 100 9.1 8 8.3 11.7 16.5 12.3 15.5
3 2018 128579 100 10.1 8.8 8.7 12 17 12.5 15
4 2017… 127669 100 10 9.1 9.2 12 16.4 12.4 14.7
5 2017 127586 100 10.1 9.1 9.2 11.9 16.3 12.6 14.8
6 2016 126224 100 10.4 9 9.2 12.3 16.7 12.2 15
7 2015 125819 100 10.6 10 9.6 12.1 16.1 12.4 14.9
8 2014 124587 100 11.4 10.5 9.6 12.6 16.4 12.1 14
9 2013… 123931 100 11.4 10.3 9.5 12.5 16.8 12 13.9
10 2013… 122952 100 11.3 10.4 9.7 13.1 17 12.5 13.6
# … with 373 more rows, 6 more variables: `$150,000 to $199,999` <dbl>,
# `$200,000 and over` <dbl>, Median_Income_Estimate <dbl>,
# Median_Income_MOE <dbl>, Mean_Income_Estimate <chr>, Mean_Income_MOE <chr>,
# and abbreviated variable names ¹Household_Number_Thousands,
# ²Total_Percent_Distribution, ³`Under $15,000`, ⁴`$15,000 to $24,999`,
# ⁵`$25,000 to $34,999`, ⁶`35,000 to $49,999`, ⁷`$50,000 to $74,999`,
# ⁸`$75,000 to $99,999`, ⁹`$100,000 to $149,999`
Since the “USA Households*.xlsx” data set is in Excel format, I am using the ‘readxl’ package for reading the data. After reading, the data is stored in a dataframe “usa_data”. The first three rows in the dataframe contains description about the data and the 4th and 5th row contains the column headings. I resolve this issue by skipping the first 5 rows while reading the data set and storing in dataframe with the renamed column names.
# A tibble: 352 × 16
Year House…¹ Total…² Under…³ $15,0…⁴ $25,0…⁵ 35,00…⁶ $50,0…⁷ $75,0…⁸ $100,…⁹
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ALL … <NA> NA NA NA NA NA NA NA NA
2 2019 128451 100 9.1 8 8.3 11.7 16.5 12.3 15.5
3 2018 128579 100 10.1 8.8 8.7 12 17 12.5 15
4 2017… 127669 100 10 9.1 9.2 12 16.4 12.4 14.7
5 2017 127586 100 10.1 9.1 9.2 11.9 16.3 12.6 14.8
6 2016 126224 100 10.4 9 9.2 12.3 16.7 12.2 15
7 2015 125819 100 10.6 10 9.6 12.1 16.1 12.4 14.9
8 2014 124587 100 11.4 10.5 9.6 12.6 16.4 12.1 14
9 2013… 123931 100 11.4 10.3 9.5 12.5 16.8 12 13.9
10 2013… 122952 100 11.3 10.4 9.7 13.1 17 12.5 13.6
# … with 342 more rows, 6 more variables: `$150,000 to $199,999` <dbl>,
# `$200,000 and over` <dbl>, Median_Income_Estimate <dbl>,
# Median_Income_MOE <dbl>, Mean_Income_Estimate <chr>, Mean_Income_MOE <chr>,
# and abbreviated variable names ¹Household_Number_Thousands,
# ²Total_Percent_Distribution, ³`Under $15,000`, ⁴`$15,000 to $24,999`,
# ⁵`$25,000 to $34,999`, ⁶`35,000 to $49,999`, ⁷`$50,000 to $74,999`,
# ⁸`$75,000 to $99,999`, ⁹`$100,000 to $149,999`
# Creating new column for Race and filling the empty values for Race with the previous value in that column
usa_data <- usa_data%>%
mutate(Race = case_when(str_detect(Year,("([A-Z])")) ~ Year))%>%
fill(Race, .direction = "down")
# Removing the rows from usa_data which has non-numerical values in Year column (these rows have Race value in the Year column and were inserted as separators for different Race groups)
usa_data <- usa_data%>%
filter(!str_detect(Year,("([A-Z])")))
print(usa_data)
# A tibble: 340 × 17
Year House…¹ Total…² Under…³ $15,0…⁴ $25,0…⁵ 35,00…⁶ $50,0…⁷ $75,0…⁸ $100,…⁹
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2019 128451 100 9.1 8 8.3 11.7 16.5 12.3 15.5
2 2018 128579 100 10.1 8.8 8.7 12 17 12.5 15
3 2017… 127669 100 10 9.1 9.2 12 16.4 12.4 14.7
4 2017 127586 100 10.1 9.1 9.2 11.9 16.3 12.6 14.8
5 2016 126224 100 10.4 9 9.2 12.3 16.7 12.2 15
6 2015 125819 100 10.6 10 9.6 12.1 16.1 12.4 14.9
7 2014 124587 100 11.4 10.5 9.6 12.6 16.4 12.1 14
8 2013… 123931 100 11.4 10.3 9.5 12.5 16.8 12 13.9
9 2013… 122952 100 11.3 10.4 9.7 13.1 17 12.5 13.6
10 2012 122459 100 11.4 10.6 10.1 12.5 17.4 12 13.9
# … with 330 more rows, 7 more variables: `$150,000 to $199,999` <dbl>,
# `$200,000 and over` <dbl>, Median_Income_Estimate <dbl>,
# Median_Income_MOE <dbl>, Mean_Income_Estimate <chr>, Mean_Income_MOE <chr>,
# Race <chr>, and abbreviated variable names ¹Household_Number_Thousands,
# ²Total_Percent_Distribution, ³`Under $15,000`, ⁴`$15,000 to $24,999`,
# ⁵`$25,000 to $34,999`, ⁶`35,000 to $49,999`, ⁷`$50,000 to $74,999`,
# ⁸`$75,000 to $99,999`, ⁹`$100,000 to $149,999`
# A tibble: 340 × 17
Year House…¹ Total…² Under…³ $15,0…⁴ $25,0…⁵ 35,00…⁶ $50,0…⁷ $75,0…⁸ $100,…⁹
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2019 128451 100 9.1 8 8.3 11.7 16.5 12.3 15.5
2 2018 128579 100 10.1 8.8 8.7 12 17 12.5 15
3 2017 127669 100 10 9.1 9.2 12 16.4 12.4 14.7
4 2017 127586 100 10.1 9.1 9.2 11.9 16.3 12.6 14.8
5 2016 126224 100 10.4 9 9.2 12.3 16.7 12.2 15
6 2015 125819 100 10.6 10 9.6 12.1 16.1 12.4 14.9
7 2014 124587 100 11.4 10.5 9.6 12.6 16.4 12.1 14
8 2013 123931 100 11.4 10.3 9.5 12.5 16.8 12 13.9
9 2013 122952 100 11.3 10.4 9.7 13.1 17 12.5 13.6
10 2012 122459 100 11.4 10.6 10.1 12.5 17.4 12 13.9
# … with 330 more rows, 7 more variables: `$150,000 to $199,999` <dbl>,
# `$200,000 and over` <dbl>, Median_Income_Estimate <dbl>,
# Median_Income_MOE <dbl>, Mean_Income_Estimate <chr>, Mean_Income_MOE <chr>,
# Race <chr>, and abbreviated variable names ¹Household_Number_Thousands,
# ²Total_Percent_Distribution, ³`Under $15,000`, ⁴`$15,000 to $24,999`,
# ⁵`$25,000 to $34,999`, ⁶`35,000 to $49,999`, ⁷`$50,000 to $74,999`,
# ⁸`$75,000 to $99,999`, ⁹`$100,000 to $149,999`
# A tibble: 340 × 17
Year House…¹ Total…² Under…³ $15,0…⁴ $25,0…⁵ 35,00…⁶ $50,0…⁷ $75,0…⁸ $100,…⁹
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2019 128451 100 9.1 8 8.3 11.7 16.5 12.3 15.5
2 2018 128579 100 10.1 8.8 8.7 12 17 12.5 15
3 2017 127669 100 10 9.1 9.2 12 16.4 12.4 14.7
4 2017 127586 100 10.1 9.1 9.2 11.9 16.3 12.6 14.8
5 2016 126224 100 10.4 9 9.2 12.3 16.7 12.2 15
6 2015 125819 100 10.6 10 9.6 12.1 16.1 12.4 14.9
7 2014 124587 100 11.4 10.5 9.6 12.6 16.4 12.1 14
8 2013 123931 100 11.4 10.3 9.5 12.5 16.8 12 13.9
9 2013 122952 100 11.3 10.4 9.7 13.1 17 12.5 13.6
10 2012 122459 100 11.4 10.6 10.1 12.5 17.4 12 13.9
# … with 330 more rows, 7 more variables: `$150,000 to $199,999` <dbl>,
# `$200,000 and over` <dbl>, Median_Income_Estimate <dbl>,
# Median_Income_MOE <dbl>, Mean_Income_Estimate <chr>, Mean_Income_MOE <chr>,
# Race <chr>, and abbreviated variable names ¹Household_Number_Thousands,
# ²Total_Percent_Distribution, ³`Under $15,000`, ⁴`$15,000 to $24,999`,
# ⁵`$25,000 to $34,999`, ⁶`35,000 to $49,999`, ⁷`$50,000 to $74,999`,
# ⁸`$75,000 to $99,999`, ⁹`$100,000 to $149,999`
# A tibble: 340 × 16
Year House…¹ Under…² $15,0…³ $25,0…⁴ 35,00…⁵ $50,0…⁶ $75,0…⁷ $100,…⁸ $150,…⁹
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2019 128451 9.1 8 8.3 11.7 16.5 12.3 15.5 8.3
2 2018 128579 10.1 8.8 8.7 12 17 12.5 15 7.2
3 2017 127669 10 9.1 9.2 12 16.4 12.4 14.7 7.3
4 2017 127586 10.1 9.1 9.2 11.9 16.3 12.6 14.8 7.5
5 2016 126224 10.4 9 9.2 12.3 16.7 12.2 15 7.2
6 2015 125819 10.6 10 9.6 12.1 16.1 12.4 14.9 7.1
7 2014 124587 11.4 10.5 9.6 12.6 16.4 12.1 14 6.6
8 2013 123931 11.4 10.3 9.5 12.5 16.8 12 13.9 6.7
9 2013 122952 11.3 10.4 9.7 13.1 17 12.5 13.6 6.3
10 2012 122459 11.4 10.6 10.1 12.5 17.4 12 13.9 6.3
# … with 330 more rows, 6 more variables: `$200,000 and over` <dbl>,
# Median_Income_Estimate <dbl>, Median_Income_MOE <dbl>,
# Mean_Income_Estimate <chr>, Mean_Income_MOE <chr>, Race <chr>, and
# abbreviated variable names ¹Household_Number_Thousands, ²`Under $15,000`,
# ³`$15,000 to $24,999`, ⁴`$25,000 to $34,999`, ⁵`35,000 to $49,999`,
# ⁶`$50,000 to $74,999`, ⁷`$75,000 to $99,999`, ⁸`$100,000 to $149,999`,
# ⁹`$150,000 to $199,999`
# A tibble: 340 × 16
Race Year House…¹ Under…² $15,0…³ $25,0…⁴ 35,00…⁵ $50,0…⁶ $75,0…⁷ $100,…⁸
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ALL RA… 2019 128451 9.1 8 8.3 11.7 16.5 12.3 15.5
2 ALL RA… 2018 128579 10.1 8.8 8.7 12 17 12.5 15
3 ALL RA… 2017 127669 10 9.1 9.2 12 16.4 12.4 14.7
4 ALL RA… 2017 127586 10.1 9.1 9.2 11.9 16.3 12.6 14.8
5 ALL RA… 2016 126224 10.4 9 9.2 12.3 16.7 12.2 15
6 ALL RA… 2015 125819 10.6 10 9.6 12.1 16.1 12.4 14.9
7 ALL RA… 2014 124587 11.4 10.5 9.6 12.6 16.4 12.1 14
8 ALL RA… 2013 123931 11.4 10.3 9.5 12.5 16.8 12 13.9
9 ALL RA… 2013 122952 11.3 10.4 9.7 13.1 17 12.5 13.6
10 ALL RA… 2012 122459 11.4 10.6 10.1 12.5 17.4 12 13.9
# … with 330 more rows, 6 more variables: `$150,000 to $199,999` <dbl>,
# `$200,000 and over` <dbl>, Median_Income_Estimate <dbl>,
# Median_Income_MOE <dbl>, Mean_Income_Estimate <chr>, Mean_Income_MOE <chr>,
# and abbreviated variable names ¹Household_Number_Thousands,
# ²`Under $15,000`, ³`$15,000 to $24,999`, ⁴`$25,000 to $34,999`,
# ⁵`35,000 to $49,999`, ⁶`$50,000 to $74,999`, ⁷`$75,000 to $99,999`,
# ⁸`$100,000 to $149,999`
[1] 53
[1] "2019" "2018" "2017" "2016" "2015" "2014" "2013" "2012" "2011" "2010"
[11] "2009" "2008" "2007" "2006" "2005" "2004" "2003" "2002" "2001" "2000"
[21] "1999" "1998" "1997" "1996" "1995" "1994" "1993" "1992" "1991" "1990"
[31] "1989" "1988" "1987" "1986" "1985" "1984" "1983" "1982" "1981" "1980"
[41] "1979" "1978" "1977" "1976" "1975" "1974" "1973" "1972" "1971" "1970"
[51] "1969" "1968" "1967"
[1] 12
[1] "ALL RACES" "WHITE ALONE "
[3] "WHITE " "WHITE ALONE, NOT HISPANIC "
[5] "WHITE, NOT HISPANIC " "BLACK ALONE OR IN COMBINATION"
[7] "BLACK ALONE " "BLACK "
[9] "ASIAN ALONE OR IN COMBINATION" "ASIAN ALONE "
[11] "ASIAN AND PACIFIC ISLANDER " "HISPANIC (ANY RACE) "
Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Race [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Year [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Household_Number_Thousands [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Under $15,000 [numeric] |
|
142 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
$15,000 to $24,999 [numeric] |
|
104 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
$25,000 to $34,999 [numeric] |
|
87 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
35,000 to $49,999 [numeric] |
|
93 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
$50,000 to $74,999 [numeric] |
|
92 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
$75,000 to $99,999 [numeric] |
|
86 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
$100,000 to $149,999 [numeric] |
|
130 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
$150,000 to $199,999 [numeric] |
|
104 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
$200,000 and over [numeric] |
|
115 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Median_Income_Estimate [numeric] |
|
339 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Median_Income_MOE [numeric] |
|
308 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Mean_Income_Estimate [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Mean_Income_MOE [character] |
|
|
0 (0.0%) |
Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-21
After reading the data and renaming the columns, we have 383 rows and 16 columns in the dataframe. Out of the 383 rows, the last 31 rows are footnotes and are removed resulting in 352 rows and 16 columns. The first column “Year” has information on the Year and Race and Hispanic origin of householder. The race value in the first column is currently used as a separator for the different race groups in the data. Hence, I made a new column called “Race” and filled the column with the race value if available, else filled with the previous race value. We get a total of 12 distinct race groups. Following that, I removed the rows from the dataframe that had non-numerical values in the first column as these rows with the Race value were just used as a divider between race groups. This results in a dataframe of 340 rows and 17 columns. Now, we have separate columns for “Year” and “Race”. Next, I removed the footnote numbers next to the values in the “Year” and “Race” columns. I also removed the column “Total_Percent_Distribution” as the value is 100 for all observations and is not significant. If we add the percentage value for all 9 income levels in an observation it should add up to 100. Finally, I reordered the columns so that “Race” is the first column in the dataframe followed by “Year” and the remaining columns. After cleaning the data, I end up with a dataframe of 340 observations and 16 columns/attributes. I did a sanity check to make sure that we have data for the period 1967 - 2019 and we have 12 race categories in total. Currently, I have not removed the observations with race value as “ALL RACES” as I am not sure whether they include the other race categories. I summarized the data using dfSummary() function and made sure that there are no duplicates or missing values in the data.
Describe the data, and be sure to comment on why you are planning to pivot it to make it “tidy”
The USA household data contains information on the mean and median income grouped by Race and Hispanic Origin of householder for the period 1967 - 2019. The data is split into 12 different categories based on Races and we have the total number of households surveyed in a given year along with the total percentage distribution (100 for all observations), percentage distribution for various income levels, and the mean and median (estimated and margin of error) income. For some races, data is not available for all the years in the period 1967 - 2019.
Currently the data has multiple observations combined in a single row and this makes it difficult to perform data manipulation operations like filtering, grouping, etc. Hence, we need to pivot the data to make it possible to group the data and calculate aggregates/statistics for the 12 different race categories corresponding to the years 1967 - 2019. We can pivot the data to find the percentage of each income level for given race and year and visualize them as individual observations. This will also help in plotting graphs to visualize trends.
The first step in pivoting the data is to try to come up with a concrete vision of what the end product should look like - that way you will know whether or not your pivoting was successful.
One easy way to do this is to think about the dimensions of your current data (tibble, dataframe, or matrix), and then calculate what the dimensions of the pivoted data should be.
Suppose you have a dataset with \(n\) rows and \(k\) variables. In our example, 3 of the variables are used to identify a case, so you will be pivoting \(k-3\) variables into a longer format where the \(k-3\) variable names will move into the names_to
variable and the current values in each of those columns will move into the values_to
variable. Therefore, we would expect \(n * (k-3)\) rows in the pivoted dataframe!
Document your work here.
# A tibble: 340 × 16
Race Year House…¹ Under…² $15,0…³ $25,0…⁴ 35,00…⁵ $50,0…⁶ $75,0…⁷ $100,…⁸
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ALL RA… 2019 128451 9.1 8 8.3 11.7 16.5 12.3 15.5
2 ALL RA… 2018 128579 10.1 8.8 8.7 12 17 12.5 15
3 ALL RA… 2017 127669 10 9.1 9.2 12 16.4 12.4 14.7
4 ALL RA… 2017 127586 10.1 9.1 9.2 11.9 16.3 12.6 14.8
5 ALL RA… 2016 126224 10.4 9 9.2 12.3 16.7 12.2 15
6 ALL RA… 2015 125819 10.6 10 9.6 12.1 16.1 12.4 14.9
7 ALL RA… 2014 124587 11.4 10.5 9.6 12.6 16.4 12.1 14
8 ALL RA… 2013 123931 11.4 10.3 9.5 12.5 16.8 12 13.9
9 ALL RA… 2013 122952 11.3 10.4 9.7 13.1 17 12.5 13.6
10 ALL RA… 2012 122459 11.4 10.6 10.1 12.5 17.4 12 13.9
# … with 330 more rows, 6 more variables: `$150,000 to $199,999` <dbl>,
# `$200,000 and over` <dbl>, Median_Income_Estimate <dbl>,
# Median_Income_MOE <dbl>, Mean_Income_Estimate <chr>, Mean_Income_MOE <chr>,
# and abbreviated variable names ¹Household_Number_Thousands,
# ²`Under $15,000`, ³`$15,000 to $24,999`, ⁴`$25,000 to $34,999`,
# ⁵`35,000 to $49,999`, ⁶`$50,000 to $74,999`, ⁷`$75,000 to $99,999`,
# ⁸`$100,000 to $149,999`
# A tibble: 340 × 12
Race Year House…¹ Under…² $15,0…³ $25,0…⁴ 35,00…⁵ $50,0…⁶ $75,0…⁷ $100,…⁸
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ALL RA… 2019 128451 9.1 8 8.3 11.7 16.5 12.3 15.5
2 ALL RA… 2018 128579 10.1 8.8 8.7 12 17 12.5 15
3 ALL RA… 2017 127669 10 9.1 9.2 12 16.4 12.4 14.7
4 ALL RA… 2017 127586 10.1 9.1 9.2 11.9 16.3 12.6 14.8
5 ALL RA… 2016 126224 10.4 9 9.2 12.3 16.7 12.2 15
6 ALL RA… 2015 125819 10.6 10 9.6 12.1 16.1 12.4 14.9
7 ALL RA… 2014 124587 11.4 10.5 9.6 12.6 16.4 12.1 14
8 ALL RA… 2013 123931 11.4 10.3 9.5 12.5 16.8 12 13.9
9 ALL RA… 2013 122952 11.3 10.4 9.7 13.1 17 12.5 13.6
10 ALL RA… 2012 122459 11.4 10.6 10.1 12.5 17.4 12 13.9
# … with 330 more rows, 2 more variables: `$150,000 to $199,999` <dbl>,
# `$200,000 and over` <dbl>, and abbreviated variable names
# ¹Household_Number_Thousands, ²`Under $15,000`, ³`$15,000 to $24,999`,
# ⁴`$25,000 to $34,999`, ⁵`35,000 to $49,999`, ⁶`$50,000 to $74,999`,
# ⁷`$75,000 to $99,999`, ⁸`$100,000 to $149,999`
Any additional comments?
I plan on pivoting the data using pivot_longer() function (increases the number of rows and decreases the number of columns) and the resultant pivot table will have way more rows in comparison to the less number of columns. Doing this will increase the ease of comparison for different income levels for given race and year.
Currently, the dataframe contains 340 rows and 16 columns. When we pivot the data, the mean and median income will be redundant for each observation and makes comparison strenuous. Hence, I subset the data and remove the columns containing mean and median income. We can use those attributes in the next steps by creating a new pivot table using them. After subsetting the data, we are left with 340 rows and 12 columns. The pivot will be done on the 9 different income levels that will be represented as a single column called “Income Level” and the corresponding percentage values will be stored in a column called “Percent Value” keeping the “Race”, “Year”, and “Household_Number_Thousands” columns. Therefore, after pivoting we should expect 9 * 340 = 3060 rows and 3 + 2 = 5 columns. The 5 columns are “Race”, “Year”, “Household_Number_Thousands”, “Income Level” and “Percent Value”.
Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a “sanity” check.
Document your work here. What will a new “case” be once you have pivoted the data? How does it meet requirements for tidy data?
# A tibble: 3,060 × 5
Race Year Household_Number_Thousands `Income Level` Percent Val…¹
<chr> <chr> <chr> <chr> <dbl>
1 ALL RACES 2019 128451 Under $15,000 9.1
2 ALL RACES 2019 128451 $15,000 to $24,999 8
3 ALL RACES 2019 128451 $25,000 to $34,999 8.3
4 ALL RACES 2019 128451 35,000 to $49,999 11.7
5 ALL RACES 2019 128451 $50,000 to $74,999 16.5
6 ALL RACES 2019 128451 $75,000 to $99,999 12.3
7 ALL RACES 2019 128451 $100,000 to $149,999 15.5
8 ALL RACES 2019 128451 $150,000 to $199,999 8.3
9 ALL RACES 2019 128451 $200,000 and over 10.3
10 ALL RACES 2018 128579 Under $15,000 10.1
# … with 3,050 more rows, and abbreviated variable name ¹`Percent Value`
Any additional comments?
The pivoted table has 3060 rows/observations and 5 columns which is same as the expected rows and expected columns we had predicted in the previous section. In the above pivot table, we are able to observe the count of Households and the percentage value for the 9 different income levels for each Race and Year.
---
title: "Challenge 3 Solutions"
author: "Vinitha Maheswaran"
desription: "Tidy Data: Pivoting"
date: "11/24/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_3
- animal_weights
- eggs
- australian_marriage
- usa_households
- sce_labor
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to:
1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
2. identify what needs to be done to tidy the current data
3. anticipate the shape of pivoted data
4. pivot the data into tidy format using `pivot_longer`
## Read in data
Read in one (or more) of the following datasets, using the correct R package and command.
- animal_weights.csv ⭐
- eggs_tidy.csv ⭐⭐ or organiceggpoultry.xls ⭐⭐⭐
- australian_marriage\*.xls ⭐⭐⭐
- USA Households\*.xlsx ⭐⭐⭐⭐
- sce_labor_chart_data_public.xlsx 🌟🌟🌟🌟🌟
For this challenge I will be working with the "USA Households\*.xlsx" data set.
```{r}
library(readxl)
# Reading the USA Households\*.xlsx data set and storing in a data frame
column_names = c("Year", "Household_Number_Thousands", "Total_Percent_Distribution", "Under $15,000", "$15,000 to $24,999", "$25,000 to $34,999", "35,000 to $49,999", "$50,000 to $74,999", "$75,000 to $99,999", "$100,000 to $149,999", "$150,000 to $199,999", "$200,000 and over", "Median_Income_Estimate", "Median_Income_MOE", "Mean_Income_Estimate", "Mean_Income_MOE")
usa_data <- read_excel("_data/USA Households by Total Money Income, Race, and Hispanic Origin of Householder 1967 to 2019.xlsx", col_names = column_names, skip = 5)
print(usa_data)
```
Since the “USA Households\*.xlsx” data set is in Excel format, I am using the 'readxl' package for reading the data. After reading, the data is stored in a dataframe “usa_data". The first three rows in the dataframe contains description about the data and the 4th and 5th row contains the column headings. I resolve this issue by skipping the first 5 rows while reading the data set and storing in dataframe with the renamed column names.
## Data Cleaning
```{r}
# Removing the last 31 rows as they are just footnotes and not observations
usa_data <- head(usa_data,-31)
print(usa_data)
```
```{r}
# Dimensions of usa_data
dim(usa_data)
```
```{r}
# Creating new column for Race and filling the empty values for Race with the previous value in that column
usa_data <- usa_data%>%
mutate(Race = case_when(str_detect(Year,("([A-Z])")) ~ Year))%>%
fill(Race, .direction = "down")
# Removing the rows from usa_data which has non-numerical values in Year column (these rows have Race value in the Year column and were inserted as separators for different Race groups)
usa_data <- usa_data%>%
filter(!str_detect(Year,("([A-Z])")))
print(usa_data)
```
```{r}
# Removing the footnote number next to the year value in the "Year" column
usa_data <- usa_data%>%
mutate(Year = str_remove(Year, " .*"))
usa_data
```
```{r}
# Removing the footnote number next to the race value in the "Race" column
usa_data <- usa_data%>%
mutate(Race = str_remove(Race, "[0-9]+"))
usa_data
```
```{r}
# Remove the "Total_Percent_Distribution" column as that value is 100 for all observations
usa_data <- usa_data%>%
subset(select = -c(3))
usa_data
```
```{r}
# Reorder the columns in the usa_data dataframe so that "Race" is the first column followed by "Year"
usa_data <- usa_data%>%
subset(select=c(16, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15))
usa_data
```
```{r}
# Find number of distinct Year values
n_distinct(usa_data$Year)
# Sanity check: Check that the unique Year values range from 1967 - 2019 after data cleaning
unique(usa_data$Year)
```
```{r}
# Find number of distinct Race values
n_distinct(usa_data$Race)
# Sanity check: Check that the unique Race values do not have any footnote numbers after data cleaning
unique(usa_data$Race)
```
```{r}
#Summary of usa_data
library(summarytools)
print(summarytools::dfSummary(usa_data,
varnumbers = FALSE,
plain.ascii = FALSE,
style = "grid",
graph.magnif = 0.60,
valid.col = FALSE),
method = 'render',
table.classes = 'table-condensed')
```
After reading the data and renaming the columns, we have 383 rows and 16 columns in the dataframe. Out of the 383 rows, the last 31 rows are footnotes and are removed resulting in 352 rows and 16 columns. The first column “Year” has information on the Year and Race and Hispanic origin of householder. The race value in the first column is currently used as a separator for the different race groups in the data. Hence, I made a new column called “Race” and filled the column with the race value if available, else filled with the previous race value. We get a total of 12 distinct race groups. Following that, I removed the rows from the dataframe that had non-numerical values in the first column as these rows with the Race value were just used as a divider between race groups. This results in a dataframe of 340 rows and 17 columns. Now, we have separate columns for “Year” and “Race”. Next, I removed the footnote numbers next to the values in the “Year” and “Race” columns. I also removed the column “Total_Percent_Distribution” as the value is 100 for all observations and is not significant. If we add the percentage value for all 9 income levels in an observation it should add up to 100. Finally, I reordered the columns so that “Race” is the first column in the dataframe followed by “Year” and the remaining columns. After cleaning the data, I end up with a dataframe of 340 observations and 16 columns/attributes. I did a sanity check to make sure that we have data for the period 1967 - 2019 and we have 12 race categories in total. Currently, I have not removed the observations with race value as "ALL RACES" as I am not sure whether they include the other race categories. I summarized the data using dfSummary() function and made sure that there are no duplicates or missing values in the data.
### Briefly describe the data
Describe the data, and be sure to comment on why you are planning to pivot it to make it "tidy"
The USA household data contains information on the mean and median income grouped by Race and Hispanic Origin of householder for the period 1967 - 2019. The data is split into 12 different categories based on Races and we have the total number of households surveyed in a given year along with the total percentage distribution (100 for all observations), percentage distribution for various income levels, and the mean and median (estimated and margin of error) income. For some races, data is not available for all the years in the period 1967 - 2019.
Currently the data has multiple observations combined in a single row and this makes it difficult to perform data manipulation operations like filtering, grouping, etc. Hence, we need to pivot the data to make it possible to group the data and calculate aggregates/statistics for the 12 different race categories corresponding to the years 1967 - 2019. We can pivot the data to find the percentage of each income level for given race and year and visualize them as individual observations. This will also help in plotting graphs to visualize trends.
## Anticipate the End Result
The first step in pivoting the data is to try to come up with a concrete vision of what the end product *should* look like - that way you will know whether or not your pivoting was successful.
One easy way to do this is to think about the dimensions of your current data (tibble, dataframe, or matrix), and then calculate what the dimensions of the pivoted data should be.
Suppose you have a dataset with $n$ rows and $k$ variables. In our example, 3 of the variables are used to identify a case, so you will be pivoting $k-3$ variables into a longer format where the $k-3$ variable names will move into the `names_to` variable and the current values in each of those columns will move into the `values_to` variable. Therefore, we would expect $n * (k-3)$ rows in the pivoted dataframe!
### Challenge: Describe the final dimensions
Document your work here.
```{r}
# Printing the dataframe
usa_data
```
```{r}
# Removing the columns containing mean and median income
usa_data_final <- usa_data%>%
subset(select=(-c(13,14,15,16)))
usa_data_final
```
```{r}
#existing rows/cases
nrow(usa_data_final)
```
```{r}
#existing columns/cases
ncol(usa_data_final)
```
```{r}
#expected rows/cases
nrow(usa_data_final) * (ncol(usa_data_final)-3)
```
```{r}
# expected columns
3 + 2
```
Any additional comments?
I plan on pivoting the data using pivot_longer() function (increases the number of rows and decreases the number of columns) and the resultant pivot table will have way more rows in comparison to the less number of columns. Doing this will increase the ease of comparison for different income levels for given race and year.
Currently, the dataframe contains 340 rows and 16 columns. When we pivot the data, the mean and median income will be redundant for each observation and makes comparison strenuous. Hence, I subset the data and remove the columns containing mean and median income. We can use those attributes in the next steps by creating a new pivot table using them. After subsetting the data, we are left with 340 rows and 12 columns. The pivot will be done on the 9 different income levels that will be represented as a single column called “Income Level” and the corresponding percentage values will be stored in a column called “Percent Value” keeping the “Race”, “Year”, and “Household_Number_Thousands” columns. Therefore, after pivoting we should expect 9 * 340 = 3060 rows and 3 + 2 = 5 columns. The 5 columns are “Race”, “Year”, “Household_Number_Thousands”, “Income Level” and “Percent Value”.
## Pivot the Data
Now we will pivot the data, and compare our pivoted data dimensions to the dimensions calculated above as a "sanity" check.
### Challenge: Pivot the Chosen Data
Document your work here. What will a new "case" be once you have pivoted the data? How does it meet requirements for tidy data?
```{r}
# Pivoting the data using pivot_longer()
df <- pivot_longer(usa_data_final, col = c(4:12),
names_to = "Income Level",
values_to = "Percent Value")
df
```
```{r}
# Checking dimension of the pivot table
dim(df)
```
Any additional comments?
The pivoted table has 3060 rows/observations and 5 columns which is same as the expected rows and expected columns we had predicted in the previous section. In the above pivot table, we are able to observe the count of Households and the percentage value for the 9 different income levels for each Race and Year.