challenge_4
Author

Lai Wei

Published

August 18, 2022

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. tidy data (as needed, including sanity checks)
  3. identify variables that need to be mutated
  4. mutate variables and sanity check all mutations

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

  • abc_poll.csv ⭐
  • poultry_tidy.csv⭐⭐
  • FedFundsRate.csv⭐⭐⭐
  • hotel_bookings.csv⭐⭐⭐⭐
  • debt_in_trillions ⭐⭐⭐⭐⭐
Code
abc_poll<-read_csv("_data/abc_poll_2021.csv")
#national poller information
abc_poll %>%
  select(starts_with("p"))
# A tibble: 527 × 14
   ppage ppeduc5    ppedu…¹ ppgen…² ppethm pphhs…³ ppinc7 ppmar…⁴ ppmsa…⁵ ppreg4
   <dbl> <chr>      <chr>   <chr>   <chr>  <chr>   <chr>  <chr>   <chr>   <chr> 
 1    68 "High sch… High s… Female  White… 2       $25,0… Now Ma… Metro … South 
 2    85 "Bachelor… Bachel… Male    White… 2       $150,… Now Ma… Metro … South 
 3    69 "High sch… High s… Male    White… 2       $100,… Now Ma… Metro … South 
 4    74 "Bachelor… Bachel… Female  White… 1       $25,0… Divorc… Metro … North…
 5    77 "High sch… High s… Male    White… 3       $10,0… Now Ma… Metro … MidWe…
 6    70 "Bachelor… Bachel… Male    White… 2       $75,0… Now Ma… Metro … MidWe…
 7    26 "Master\x… Bachel… Male    Other… 3       $150,… Never … Metro … North…
 8    76 "Bachelor… Bachel… Male    Black… 2       $50,0… Now Ma… Metro … South 
 9    78 "Bachelor… Bachel… Female  White… 2       $150,… Now Ma… Metro … West  
10    47 "Master\x… Bachel… Male    Other… 4       $150,… Now Ma… Non-me… North…
# … with 517 more rows, 4 more variables: pprent <chr>, ppstaten <chr>,
#   PPWORKA <chr>, ppemploy <chr>, and abbreviated variable names ¹​ppeducat,
#   ²​ppgender, ³​pphhsize, ⁴​ppmarit5, ⁵​ppmsacat
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
Code
  colnames(abc_poll)
 [1] "id"              "xspanish"        "complete_status" "ppage"          
 [5] "ppeduc5"         "ppeducat"        "ppgender"        "ppethm"         
 [9] "pphhsize"        "ppinc7"          "ppmarit5"        "ppmsacat"       
[13] "ppreg4"          "pprent"          "ppstaten"        "PPWORKA"        
[17] "ppemploy"        "Q1_a"            "Q1_b"            "Q1_c"           
[21] "Q1_d"            "Q1_e"            "Q1_f"            "Q2"             
[25] "Q3"              "Q4"              "Q5"              "QPID"           
[29] "ABCAGE"          "Contact"         "weights_pid"    

Briefly describe the data

Tidy Data (as needed)

Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.

Code
#political questions
abc_poll %>%
  select(starts_with("q")) %>%
  colnames(.)
 [1] "Q1_a" "Q1_b" "Q1_c" "Q1_d" "Q1_e" "Q1_f" "Q2"   "Q3"   "Q4"   "Q5"  
[11] "QPID"
Code
#number of ethnicity 
n_distinct(abc_poll$ppethm)
[1] 5
Code
#Table with selecting  five variables
Table1 <- select(abc_poll, "ppage", "ppeducat", "ppethm", "pprent", "ppemploy") 
Table1
# A tibble: 527 × 5
   ppage ppeducat                   ppethm              pprent           ppemp…¹
   <dbl> <chr>                      <chr>               <chr>            <chr>  
 1    68 High school                White, Non-Hispanic Owned or being … Not wo…
 2    85 Bachelors degree or higher White, Non-Hispanic Owned or being … Not wo…
 3    69 High school                White, Non-Hispanic Owned or being … Not wo…
 4    74 Bachelors degree or higher White, Non-Hispanic Owned or being … Not wo…
 5    77 High school                White, Non-Hispanic Owned or being … Not wo…
 6    70 Bachelors degree or higher White, Non-Hispanic Owned or being … Workin…
 7    26 Bachelors degree or higher Other, Non-Hispanic Owned or being … Workin…
 8    76 Bachelors degree or higher Black, Non-Hispanic Owned or being … Not wo…
 9    78 Bachelors degree or higher White, Non-Hispanic Owned or being … Not wo…
10    47 Bachelors degree or higher Other, Non-Hispanic Owned or being … Workin…
# … with 517 more rows, and abbreviated variable name ¹​ppemploy
# ℹ Use `print(n = ...)` to see more rows

Any additional comments?

Identify variables that need to be mutated

Are there any variables that require mutation to be usable in your analysis stream? For example, are all time variables correctly coded as dates? Are all string variables reduced and cleaned to sensible categories? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?

Document your work here.

Code
#Table for checking pollers' age and sex
Table2 <- abc_poll %>%
  mutate(abc_poll,
         elder = (ppage > 60),
         gender = (ppgender == 'Female')
         )
Table2
# A tibble: 527 × 33
        id xspanish comple…¹ ppage ppeduc5 ppedu…² ppgen…³ ppethm pphhs…⁴ ppinc7
     <dbl> <chr>    <chr>    <dbl> <chr>   <chr>   <chr>   <chr>  <chr>   <chr> 
 1 7230001 English  qualifi…    68 "High … High s… Female  White… 2       $25,0…
 2 7230002 English  qualifi…    85 "Bache… Bachel… Male    White… 2       $150,…
 3 7230003 English  qualifi…    69 "High … High s… Male    White… 2       $100,…
 4 7230004 English  qualifi…    74 "Bache… Bachel… Female  White… 1       $25,0…
 5 7230005 English  qualifi…    77 "High … High s… Male    White… 3       $10,0…
 6 7230006 English  qualifi…    70 "Bache… Bachel… Male    White… 2       $75,0…
 7 7230007 English  qualifi…    26 "Maste… Bachel… Male    Other… 3       $150,…
 8 7230008 English  qualifi…    76 "Bache… Bachel… Male    Black… 2       $50,0…
 9 7230009 English  qualifi…    78 "Bache… Bachel… Female  White… 2       $150,…
10 7230010 English  qualifi…    47 "Maste… Bachel… Male    Other… 4       $150,…
# … with 517 more rows, 23 more variables: ppmarit5 <chr>, ppmsacat <chr>,
#   ppreg4 <chr>, pprent <chr>, ppstaten <chr>, PPWORKA <chr>, ppemploy <chr>,
#   Q1_a <chr>, Q1_b <chr>, Q1_c <chr>, Q1_d <chr>, Q1_e <chr>, Q1_f <chr>,
#   Q2 <chr>, Q3 <chr>, Q4 <chr>, Q5 <chr>, QPID <chr>, ABCAGE <chr>,
#   Contact <chr>, weights_pid <dbl>, elder <lgl>, gender <lgl>, and
#   abbreviated variable names ¹​complete_status, ²​ppeducat, ³​ppgender,
#   ⁴​pphhsize
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

Any additional comments?