Code
library(tidyverse)
library(readxl)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Quinn He
August 19, 2022
Go to the “Eggs” label if you don’t want to see me fumble through a harder data set.
# A tibble: 383 × 16
...1 ...2 Total Under…¹ $15,0…² $25,0…³ $35,0…⁴ $50,0…⁵ $75,0…⁶ $100,…⁷
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ALL RACES <NA> NA NA NA NA NA NA NA NA
2 2019 1284… 100 9.1 8 8.3 11.7 16.5 12.3 15.5
3 2018 1285… 100 10.1 8.8 8.7 12 17 12.5 15
4 2017 2 1276… 100 10 9.1 9.2 12 16.4 12.4 14.7
5 2017 1275… 100 10.1 9.1 9.2 11.9 16.3 12.6 14.8
6 2016 1262… 100 10.4 9 9.2 12.3 16.7 12.2 15
7 2015 1258… 100 10.6 10 9.6 12.1 16.1 12.4 14.9
8 2014 1245… 100 11.4 10.5 9.6 12.6 16.4 12.1 14
9 2013 3 1239… 100 11.4 10.3 9.5 12.5 16.8 12 13.9
10 2013 4 1229… 100 11.3 10.4 9.7 13.1 17 12.5 13.6
# … with 373 more rows, 6 more variables: `$150,000\r\nto\r\n$199,999` <dbl>,
# `$200,000 and over` <dbl>, Estimate...13 <dbl>,
# `Margin of error1 (±)...14` <dbl>, Estimate...15 <chr>,
# `Margin of error1 (±)...16` <chr>, and abbreviated variable names
# ¹`Under $15,000`, ²`$15,000\r\nto\r\n$24,999`, ³`$25,000\r\nto\r\n$34,999`,
# ⁴`$35,000\r\nto\r\n$49,999`, ⁵`$50,000\r\nto\r\n$74,999`,
# ⁶`$75,000\r\nto\r\n$99,999`, ⁷`$100,000\r\nto\r\n$149,999`
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
This is a data set that contains household income by race for homeowners from 1967 to 2019. The data set is a mess with the first column containing the years listed in descending order followed by the next race in the data set. Also, the columns are listed so poorly that I had to skip some in the read-in section. One thing I notice just by combing through the data set, Asian Americans have a much higher number of people making over $200,000 in household income, but this excludes Asian Pacific Islanders. Notes at the bottom also have to be removed. Before I can really see any trends in the data, I need to clean it and organize it. Right now we do not have tidy data!
Below, the column names do not make any sense and it is clearly not tidy. By changing some of the names, I hope to make it easier to manipulate later on.
[1] "...1" "...2"
[3] "Total" "Under $15,000"
[5] "$15,000\r\nto\r\n$24,999" "$25,000\r\nto\r\n$34,999"
[7] "$35,000\r\nto\r\n$49,999" "$50,000\r\nto\r\n$74,999"
[9] "$75,000\r\nto\r\n$99,999" "$100,000\r\nto\r\n$149,999"
[11] "$150,000\r\nto\r\n$199,999" "$200,000 and over"
[13] "Estimate...13" "Margin of error1 (±)...14"
[15] "Estimate...15" "Margin of error1 (±)...16"
Okay, so I changed some of the names so it’s a little neater. Now, at least the columns, are easier to read, but there is still the problem of the “year” column. It contains all the races, as well as the years in the data set so it’s much more difficult to look at individual races. There are also random numbers at the end of some of the years.
# A tibble: 383 × 16
year num_t…¹ Total Under…² $15,0…³ $25,0…⁴ $35,0…⁵ $50,0…⁶ $75,0…⁷ $100,…⁸
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ALL RA… <NA> NA NA NA NA NA NA NA NA
2 2019 128451 100 9.1 8 8.3 11.7 16.5 12.3 15.5
3 2018 128579 100 10.1 8.8 8.7 12 17 12.5 15
4 2017 127669 100 10 9.1 9.2 12 16.4 12.4 14.7
5 2017 127586 100 10.1 9.1 9.2 11.9 16.3 12.6 14.8
6 2016 126224 100 10.4 9 9.2 12.3 16.7 12.2 15
7 2015 125819 100 10.6 10 9.6 12.1 16.1 12.4 14.9
8 2014 124587 100 11.4 10.5 9.6 12.6 16.4 12.1 14
9 2013 3 123931 100 11.4 10.3 9.5 12.5 16.8 12 13.9
10 2013 4 122952 100 11.3 10.4 9.7 13.1 17 12.5 13.6
# … with 373 more rows, 6 more variables: `$150,000\r\nto\r\n$199,999` <dbl>,
# `$200,000 and over` <dbl>, median_income <dbl>,
# `Margin of error1 (±)...14` <dbl>, mean_income <chr>,
# `Margin of error1 (±)...16` <chr>, and abbreviated variable names
# ¹num_thousands, ²`Under $15,000`, ³`$15,000\r\nto\r\n$24,999`,
# ⁴`$25,000\r\nto\r\n$34,999`, ⁵`$35,000\r\nto\r\n$49,999`,
# ⁶`$50,000\r\nto\r\n$74,999`, ⁷`$75,000\r\nto\r\n$99,999`, …
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
In the above function, I try to remove the excess numbers after the years in the “years” column, but I cannot figure out why the numbers still remain. It’s necessary I remove them in the future, but for now I will push on.
Here are our dimensions for the data before it’s pivoted
Document your work here. What will a new “case” be once you have pivoted the data? How does it meet requirements for tidy data?
# A tibble: 3,447 × 9
year num_thousands Total median_i…¹ Margi…² mean_…³ Margi…⁴ income count
<chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl>
1 ALL RACES <NA> NA NA NA <NA> <NA> "Unde… NA
2 ALL RACES <NA> NA NA NA <NA> <NA> "$15,… NA
3 ALL RACES <NA> NA NA NA <NA> <NA> "$25,… NA
4 ALL RACES <NA> NA NA NA <NA> <NA> "$35,… NA
5 ALL RACES <NA> NA NA NA <NA> <NA> "$50,… NA
6 ALL RACES <NA> NA NA NA <NA> <NA> "$75,… NA
7 ALL RACES <NA> NA NA NA <NA> <NA> "$100… NA
8 ALL RACES <NA> NA NA NA <NA> <NA> "$150… NA
9 ALL RACES <NA> NA NA NA <NA> <NA> "$200… NA
10 2019 128451 100 68703 904 98088 1042 "Unde… 9.1
# … with 3,437 more rows, and abbreviated variable names ¹median_income,
# ²`Margin of error1 (±)...14`, ³mean_income, ⁴`Margin of error1 (±)...16`
# ℹ Use `print(n = ...)` to see more rows
Well, that turned out horribly. I’m going to leave that mistake here and move on to try and fix that. Clearly, the data is anything but tidy.
Let’s try that again below.
I’m going to switch data sets because I think I am a little over my head in this data frame, from here on out I will be working with “Eggs”. Below I will run through the assignment quicker than above to save you some already spent time.
# A tibble: 120 × 5
date xldozen xlhalf_dozen large_dozen large_half_dozen
<chr> <dbl> <dbl> <dbl> <dbl>
1 Jan 2004 230 132 230 126
2 February 230 134. 226. 128.
3 March 230 137 225 131
4 April 234. 137 225 131
5 May 236 137 225 131
6 June 241 137 231. 134.
7 July 241 137 234. 134.
8 August 241 137 234. 134.
9 September 241 136. 234. 130.
10 October 241 136. 234. 128.
# … with 110 more rows
# ℹ Use `print(n = ...)` to see more rows
Above I read in the data set as an excel file. I had to look at the solutions sheet to learn how to read in the data. The only trouble I had was figuring out the “range” parameter of the function. That one I just had to copy in because I was getting an error without it, but I understand that this tells R which cells to read.
# A tibble: 22 × 1
date
<chr>
1 Jan 2004
2 February
3 March
4 April
5 May
6 June
7 July
8 August
9 September
10 October
# … with 12 more rows
# ℹ Use `print(n = ...)` to see more rows
Now we still have the issue of notes in the names of our months. This is most evident with “Jan…”, but the ” /1” in February needs to go. This needs to be removed to make analysis later on a bit easier to look at.
Next, the January columns need to be dealt with so below I will remove the years in the “Jan” column with the separate and fill function.
# A tibble: 120 × 6
month year xldozen xlhalf_dozen large_dozen large_half_dozen
<chr> <int> <dbl> <dbl> <dbl> <dbl>
1 Jan 2004 230 132 230 126
2 February 2004 230 134. 226. 128.
3 March 2004 230 137 225 131
4 April 2004 234. 137 225 131
5 May 2004 236 137 225 131
6 June 2004 241 137 231. 134.
7 July 2004 241 137 234. 134.
8 August 2004 241 137 234. 134.
9 September 2004 241 136. 234. 130.
10 October 2004 241 136. 234. 128.
# … with 110 more rows
# ℹ Use `print(n = ...)` to see more rows
There we go. Clean, easy to use data. You can see as the years progress, the price of eggs increase, though at what rate I am uncertain. Now the data set is read for some pivoting. Now let me do a sanity check to view the data dimensions before the pivot and then I will look at it after the pivot.
There are four columns with the type of eggs this data set is viewing. If we want even tidier data we can collapse these four columns into one. Lets call this new data set “eggstidy” to represent the final form of the data in this challenge.
# A tibble: 480 × 4
month year egg_type price
<chr> <int> <chr> <dbl>
1 Jan 2004 xldozen 230
2 Jan 2004 xlhalf_dozen 132
3 Jan 2004 large_dozen 230
4 Jan 2004 large_half_dozen 126
5 February 2004 xldozen 230
6 February 2004 xlhalf_dozen 134.
7 February 2004 large_dozen 226.
8 February 2004 large_half_dozen 128.
9 March 2004 xldozen 230
10 March 2004 xlhalf_dozen 137
# … with 470 more rows
# ℹ Use `print(n = ...)` to see more rows
Finally! After banging my head against the wall with the household data set, reading in, cleaning, and pivoting this easier data set was much more manageable. It was helpful to look at the solution to how to read in this data set, but after that, it was easy to move on my own.
By pivoting the data, each row has one observation, making it tidy for future manipulation. Within each row we can look at the particular variables within that observation with ease.
Below the dimensions have clearly changed, adding significantly more rows, but also condensing the amount of columns.
---
title: "Challenge 3"
author: "Quinn He"
desription: "Tidy Data: Pivoting"
date: "08/19/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_3
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
library(readxl)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Note
Go to the "Eggs" label if you don't want to see me fumble through a harder data set.
## Read in data
```{r}
household <-read_excel("_data/USA Households by Total Money Income, Race, and Hispanic Origin of Householder 1967 to 2019.xlsx",
skip = 4)
household
```
### Briefly describe the data
This is a data set that contains household income by race for homeowners from 1967 to 2019. The data set is a mess with the first column containing the years listed in descending order followed by the next race in the data set. Also, the columns are listed so poorly that I had to skip some in the read-in section. One thing I notice just by combing through the data set, Asian Americans have a much higher number of people making over \$200,000 in household income, but this excludes Asian Pacific Islanders. Notes at the bottom also have to be removed. Before I can really see any trends in the data, I need to clean it and organize it. Right now we do not have tidy data!
Below, the column names do not make any sense and it is clearly not tidy. By changing some of the names, I hope to make it easier to manipulate later on.
```{r}
colnames(household)
```
```{r}
household <- household %>%
rename("year" = "...1", "num_thousands" = "...2", "median_income" = "Estimate...13", "mean_income" = "Estimate...15")
```
Okay, so I changed some of the names so it's a little neater. Now, at least the columns, are easier to read, but there is still the problem of the "year" column. It contains all the races, as well as the years in the data set so it's much more difficult to look at individual races. There are also random numbers at the end of some of the years.
```{r}
household <- household %>%
mutate(year = str_remove(year, " [0:28]"))
household
```
In the above function, I try to remove the excess numbers after the years in the "years" column, but I cannot figure out why the numbers still remain. It's necessary I remove them in the future, but for now I will push on.
### Sanity Check
Here are our dimensions for the data before it's pivoted
```{r}
dim(household)
```
### Challenge: Pivot the Chosen Data
Document your work here. What will a new "case" be once you have pivoted the data? How does it meet requirements for tidy data?
```{r}
household2 <- pivot_longer(household, cols = 4:12,
names_to = "income",
values_to = "count")
household2
```
Well, that turned out horribly. I'm going to leave that mistake here and move on to try and fix that. Clearly, the data is anything but tidy.
Let's try that again below.
```{r}
household3 <- pivot_longer(household, 4:12, names_to = "income_brackets", values_to = "count")
```
### Switch to Eggs Data Set
I'm going to switch data sets because I think I am a little over my head in this data frame, from here on out I will be working with "Eggs". Below I will run through the assignment quicker than above to save you some already spent time.
```{r}
#| label: Eggs
eggs <- read_excel("_data/organiceggpoultry.xls",
sheet = "Data",
skip = 4,
range =cell_limits(c(6,2),c(NA,6)),
col_names = c("date", "xldozen", "xlhalf_dozen", "large_dozen", "large_half_dozen"))
eggs
```
Above I read in the data set as an excel file. I had to look at the solutions sheet to learn how to read in the data. The only trouble I had was figuring out the "range" parameter of the function. That one I just had to copy in because I was getting an error without it, but I understand that this tells R which cells to read.
```{r}
eggs %>%
select("date") %>%
distinct()
```
Now we still have the issue of notes in the names of our months. This is most evident with "Jan...", but the " /1" in February needs to go. This needs to be removed to make analysis later on a bit easier to look at.
```{r}
eggs <- eggs %>%
mutate(date = str_remove(date, " /1"))
```
Next, the January columns need to be dealt with so below I will remove the years in the "Jan" column with the separate and fill function.
```{r}
eggs <- eggs %>%
separate(date, c("month", "year"), convert = TRUE) %>%
fill("year")
eggs
```
There we go. Clean, easy to use data. You can see as the years progress, the price of eggs increase, though at what rate I am uncertain. Now the data set is read for some pivoting. Now let me do a sanity check to view the data dimensions before the pivot and then I will look at it after the pivot.
```{r}
dim(eggs)
```
There are four columns with the type of eggs this data set is viewing. If we want even tidier data we can collapse these four columns into one. Lets call this new data set "eggstidy" to represent the final form of the data in this challenge.
```{r}
eggstidy <- eggs %>%
pivot_longer(c(3:6), names_to = "egg_type", values_to = "price")
eggstidy
```
Finally! After banging my head against the wall with the household data set, reading in, cleaning, and pivoting this easier data set was much more manageable. It was helpful to look at the solution to how to read in this data set, but after that, it was easy to move on my own.
By pivoting the data, each row has one observation, making it tidy for future manipulation. Within each row we can look at the particular variables within that observation with ease.
Below the dimensions have clearly changed, adding significantly more rows, but also condensing the amount of columns.
```{r}
dim(eggstidy)
```