Code
library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Lai Wei
August 18, 2022
Today’s challenge is to:
Read in one (or more) of the following datasets, using the correct R package and command.
# A tibble: 527 × 14
ppage ppeduc5 ppedu…¹ ppgen…² ppethm pphhs…³ ppinc7 ppmar…⁴ ppmsa…⁵ ppreg4
<dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 68 "High sch… High s… Female White… 2 $25,0… Now Ma… Metro … South
2 85 "Bachelor… Bachel… Male White… 2 $150,… Now Ma… Metro … South
3 69 "High sch… High s… Male White… 2 $100,… Now Ma… Metro … South
4 74 "Bachelor… Bachel… Female White… 1 $25,0… Divorc… Metro … North…
5 77 "High sch… High s… Male White… 3 $10,0… Now Ma… Metro … MidWe…
6 70 "Bachelor… Bachel… Male White… 2 $75,0… Now Ma… Metro … MidWe…
7 26 "Master\x… Bachel… Male Other… 3 $150,… Never … Metro … North…
8 76 "Bachelor… Bachel… Male Black… 2 $50,0… Now Ma… Metro … South
9 78 "Bachelor… Bachel… Female White… 2 $150,… Now Ma… Metro … West
10 47 "Master\x… Bachel… Male Other… 4 $150,… Now Ma… Non-me… North…
# … with 517 more rows, 4 more variables: pprent <chr>, ppstaten <chr>,
# PPWORKA <chr>, ppemploy <chr>, and abbreviated variable names ¹ppeducat,
# ²ppgender, ³pphhsize, ⁴ppmarit5, ⁵ppmsacat
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
[1] "id" "xspanish" "complete_status" "ppage"
[5] "ppeduc5" "ppeducat" "ppgender" "ppethm"
[9] "pphhsize" "ppinc7" "ppmarit5" "ppmsacat"
[13] "ppreg4" "pprent" "ppstaten" "PPWORKA"
[17] "ppemploy" "Q1_a" "Q1_b" "Q1_c"
[21] "Q1_d" "Q1_e" "Q1_f" "Q2"
[25] "Q3" "Q4" "Q5" "QPID"
[29] "ABCAGE" "Contact" "weights_pid"
Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.
[1] "Q1_a" "Q1_b" "Q1_c" "Q1_d" "Q1_e" "Q1_f" "Q2" "Q3" "Q4" "Q5"
[11] "QPID"
[1] 5
# A tibble: 527 × 5
ppage ppeducat ppethm pprent ppemp…¹
<dbl> <chr> <chr> <chr> <chr>
1 68 High school White, Non-Hispanic Owned or being … Not wo…
2 85 Bachelors degree or higher White, Non-Hispanic Owned or being … Not wo…
3 69 High school White, Non-Hispanic Owned or being … Not wo…
4 74 Bachelors degree or higher White, Non-Hispanic Owned or being … Not wo…
5 77 High school White, Non-Hispanic Owned or being … Not wo…
6 70 Bachelors degree or higher White, Non-Hispanic Owned or being … Workin…
7 26 Bachelors degree or higher Other, Non-Hispanic Owned or being … Workin…
8 76 Bachelors degree or higher Black, Non-Hispanic Owned or being … Not wo…
9 78 Bachelors degree or higher White, Non-Hispanic Owned or being … Not wo…
10 47 Bachelors degree or higher Other, Non-Hispanic Owned or being … Workin…
# … with 517 more rows, and abbreviated variable name ¹ppemploy
# ℹ Use `print(n = ...)` to see more rows
Any additional comments?
Are there any variables that require mutation to be usable in your analysis stream? For example, are all time variables correctly coded as dates? Are all string variables reduced and cleaned to sensible categories? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?
Document your work here.
# A tibble: 527 × 33
id xspanish comple…¹ ppage ppeduc5 ppedu…² ppgen…³ ppethm pphhs…⁴ ppinc7
<dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1 7230001 English qualifi… 68 "High … High s… Female White… 2 $25,0…
2 7230002 English qualifi… 85 "Bache… Bachel… Male White… 2 $150,…
3 7230003 English qualifi… 69 "High … High s… Male White… 2 $100,…
4 7230004 English qualifi… 74 "Bache… Bachel… Female White… 1 $25,0…
5 7230005 English qualifi… 77 "High … High s… Male White… 3 $10,0…
6 7230006 English qualifi… 70 "Bache… Bachel… Male White… 2 $75,0…
7 7230007 English qualifi… 26 "Maste… Bachel… Male Other… 3 $150,…
8 7230008 English qualifi… 76 "Bache… Bachel… Male Black… 2 $50,0…
9 7230009 English qualifi… 78 "Bache… Bachel… Female White… 2 $150,…
10 7230010 English qualifi… 47 "Maste… Bachel… Male Other… 4 $150,…
# … with 517 more rows, 23 more variables: ppmarit5 <chr>, ppmsacat <chr>,
# ppreg4 <chr>, pprent <chr>, ppstaten <chr>, PPWORKA <chr>, ppemploy <chr>,
# Q1_a <chr>, Q1_b <chr>, Q1_c <chr>, Q1_d <chr>, Q1_e <chr>, Q1_f <chr>,
# Q2 <chr>, Q3 <chr>, Q4 <chr>, Q5 <chr>, QPID <chr>, ABCAGE <chr>,
# Contact <chr>, weights_pid <dbl>, elder <lgl>, gender <lgl>, and
# abbreviated variable names ¹complete_status, ²ppeducat, ³ppgender,
# ⁴pphhsize
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
Any additional comments?
---
title: "Challenge 4"
author: "Lai Wei"
desription: "More data wrangling: pivoting"
date: "08/18/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_4
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to:
1) read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
2) tidy data (as needed, including sanity checks)
3) identify variables that need to be mutated
4) mutate variables and sanity check all mutations
## Read in data
Read in one (or more) of the following datasets, using the correct R package and command.
- abc_poll.csv ⭐
- poultry_tidy.csv⭐⭐
- FedFundsRate.csv⭐⭐⭐
- hotel_bookings.csv⭐⭐⭐⭐
- debt_in_trillions ⭐⭐⭐⭐⭐
```{r}
abc_poll<-read_csv("_data/abc_poll_2021.csv")
#national poller information
abc_poll %>%
select(starts_with("p"))
colnames(abc_poll)
```
### Briefly describe the data
## Tidy Data (as needed)
Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.
```{r}
#political questions
abc_poll %>%
select(starts_with("q")) %>%
colnames(.)
#number of ethnicity
n_distinct(abc_poll$ppethm)
#Table with selecting five variables
Table1 <- select(abc_poll, "ppage", "ppeducat", "ppethm", "pprent", "ppemploy")
Table1
```
Any additional comments?
## Identify variables that need to be mutated
Are there any variables that require mutation to be usable in your analysis stream? For example, are all time variables correctly coded as dates? Are all string variables reduced and cleaned to sensible categories? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?
Document your work here.
```{r}
#Table for checking pollers' age and sex
Table2 <- abc_poll %>%
mutate(abc_poll,
elder = (ppage > 60),
gender = (ppgender == 'Female')
)
Table2
```
Any additional comments?