challenge_4
abc_poll
eggs
fed_rates
hotel_bookings
debt
Author

Thrishul

Published

August 18, 2022

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. tidy data (as needed, including sanity checks)
  3. identify variables that need to be mutated
  4. mutate variables and sanity check all mutations

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

  • abc_poll.csv ⭐
  • poultry_tidy.xlsx or organiceggpoultry.xls⭐⭐
  • FedFundsRate.csv⭐⭐⭐
  • hotel_bookings.csv⭐⭐⭐⭐
  • debt_in_trillions.xlsx ⭐⭐⭐⭐⭐
Code
abc_poll_orig<-read_csv("_data/abc_poll_2021.csv")

# political questions
abc_poll_orig%>%
  select(starts_with("Q"))%>%
  colnames(.)
 [1] "Q1_a" "Q1_b" "Q1_c" "Q1_d" "Q1_e" "Q1_f" "Q2"   "Q3"   "Q4"   "Q5"  
[11] "QPID"
Code
# all but one demographer
abc_poll_orig%>%
  select(starts_with("pp"))%>%
  colnames(.)
 [1] "ppage"    "ppeduc5"  "ppeducat" "ppgender" "ppethm"   "pphhsize"
 [7] "ppinc7"   "ppmarit5" "ppmsacat" "ppreg4"   "pprent"   "ppstaten"
[13] "PPWORKA"  "ppemploy"
Code
# national poll
n_distinct(abc_poll_orig$ppstaten)
[1] 49

Briefly describe the data

The ABC Poll dataset is likely a national sample survey, conducted in 2019, that includes responses from 527 individuals. The survey covers a range of topics, including 10 questions related to political attitudes and beliefs, as well as party identification. Additionally, there are 15 demographic variables included in the dataset, some of which have been recoded to facilitate analysis. Finally, the dataset also includes 5 survey administration variables that provide information on the methodology and logistics of the survey administration. Overall, the dataset is a comprehensive collection of information on political attitudes and demographics in the surveyed population, and it offers a valuable resource for researchers and analysts interested in understanding these topics

Tidy Data (as needed)

Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.

Code
print(summarytools::dfSummary(abc_poll_orig,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

abc_poll_orig

Dimensions: 527 x 31
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
id [numeric]
Mean (sd) : 7230264 (152.3)
min ≤ med ≤ max:
7230001 ≤ 7230264 ≤ 7230527
IQR (CV) : 263 (0)
527 distinct values 0 (0.0%)
xspanish [character]
1. English
2. Spanish
514 ( 97.5% )
13 ( 2.5% )
0 (0.0%)
complete_status [character] 1. qualified
527 ( 100.0% )
0 (0.0%)
ppage [numeric]
Mean (sd) : 53.4 (17.1)
min ≤ med ≤ max:
18 ≤ 55 ≤ 91
IQR (CV) : 27 (0.3)
72 distinct values 0 (0.0%)
ppeduc5 [character]
1. NA
2. NA
3. High school graduate (hig
4. No high school diploma or
5. Some college or Associate
99 ( 18.8% )
108 ( 20.5% )
133 ( 25.2% )
29 ( 5.5% )
158 ( 30.0% )
0 (0.0%)
ppeducat [character]
1. Bachelors degree or highe
2. High school
3. Less than high school
4. Some college
207 ( 39.3% )
133 ( 25.2% )
29 ( 5.5% )
158 ( 30.0% )
0 (0.0%)
ppgender [character]
1. Female
2. Male
254 ( 48.2% )
273 ( 51.8% )
0 (0.0%)
ppethm [character]
1. 2+ Races, Non-Hispanic
2. Black, Non-Hispanic
3. Hispanic
4. Other, Non-Hispanic
5. White, Non-Hispanic
21 ( 4.0% )
27 ( 5.1% )
51 ( 9.7% )
24 ( 4.6% )
404 ( 76.7% )
0 (0.0%)
pphhsize [character]
1. 1
2. 2
3. 3
4. 4
5. 5
6. 6 or more
80 ( 15.2% )
219 ( 41.6% )
102 ( 19.4% )
76 ( 14.4% )
35 ( 6.6% )
15 ( 2.8% )
0 (0.0%)
ppinc7 [character]
1. $10,000 to $24,999
2. $100,000 to $149,999
3. $150,000 or more
4. $25,000 to $49,999
5. $50,000 to $74,999
6. $75,000 to $99,999
7. Less than $10,000
32 ( 6.1% )
105 ( 19.9% )
137 ( 26.0% )
82 ( 15.6% )
85 ( 16.1% )
69 ( 13.1% )
17 ( 3.2% )
0 (0.0%)
ppmarit5 [character]
1. Divorced
2. Never married
3. Now Married
4. Separated
5. Widowed
43 ( 8.2% )
111 ( 21.1% )
337 ( 63.9% )
8 ( 1.5% )
28 ( 5.3% )
0 (0.0%)
ppmsacat [character]
1. Metro area
2. Non-metro area
448 ( 85.0% )
79 ( 15.0% )
0 (0.0%)
ppreg4 [character]
1. MidWest
2. NorthEast
3. South
4. West
118 ( 22.4% )
93 ( 17.6% )
190 ( 36.1% )
126 ( 23.9% )
0 (0.0%)
pprent [character]
1. Occupied without payment
2. Owned or being bought by
3. Rented for cash
10 ( 1.9% )
406 ( 77.0% )
111 ( 21.1% )
0 (0.0%)
ppstaten [character]
1. California
2. Texas
3. Florida
4. Pennsylvania
5. Illinois
6. New Jersey
7. Ohio
8. Michigan
9. New York
10. Washington
[ 39 others ]
51 ( 9.7% )
42 ( 8.0% )
34 ( 6.5% )
28 ( 5.3% )
23 ( 4.4% )
21 ( 4.0% )
21 ( 4.0% )
18 ( 3.4% )
18 ( 3.4% )
18 ( 3.4% )
253 ( 48.0% )
0 (0.0%)
PPWORKA [character]
1. Currently laid off
2. Employed full-time (by so
3. Employed part-time (by so
4. Full Time Student
5. Homemaker
6. On furlough
7. Other
8. Retired
9. Self-employed
13 ( 2.5% )
220 ( 41.7% )
31 ( 5.9% )
8 ( 1.5% )
37 ( 7.0% )
1 ( 0.2% )
20 ( 3.8% )
165 ( 31.3% )
32 ( 6.1% )
0 (0.0%)
ppemploy [character]
1. Not working
2. Working full-time
3. Working part-time
221 ( 41.9% )
245 ( 46.5% )
61 ( 11.6% )
0 (0.0%)
Q1_a [character]
1. Approve
2. Disapprove
3. Skipped
329 ( 62.4% )
193 ( 36.6% )
5 ( 0.9% )
0 (0.0%)
Q1_b [character]
1. Approve
2. Disapprove
3. Skipped
192 ( 36.4% )
322 ( 61.1% )
13 ( 2.5% )
0 (0.0%)
Q1_c [character]
1. Approve
2. Disapprove
3. Skipped
272 ( 51.6% )
248 ( 47.1% )
7 ( 1.3% )
0 (0.0%)
Q1_d [character]
1. Approve
2. Disapprove
3. Skipped
192 ( 36.4% )
321 ( 60.9% )
14 ( 2.7% )
0 (0.0%)
Q1_e [character]
1. Approve
2. Disapprove
3. Skipped
212 ( 40.2% )
301 ( 57.1% )
14 ( 2.7% )
0 (0.0%)
Q1_f [character]
1. Approve
2. Disapprove
3. Skipped
281 ( 53.3% )
230 ( 43.6% )
16 ( 3.0% )
0 (0.0%)
Q2 [character]
1. Not concerned at all
2. Not so concerned
3. Somewhat concerned
4. Very concerned
65 ( 12.3% )
147 ( 27.9% )
221 ( 41.9% )
94 ( 17.8% )
0 (0.0%)
Q3 [character]
1. No
2. Skipped
3. Yes
107 ( 20.3% )
5 ( 0.9% )
415 ( 78.7% )
0 (0.0%)
Q4 [character]
1. Excellent
2. Good
3. Not so good
4. Poor
5. Skipped
60 ( 11.4% )
215 ( 40.8% )
97 ( 18.4% )
149 ( 28.3% )
6 ( 1.1% )
0 (0.0%)
Q5 [character]
1. Optimistic
2. Pessimistic
3. Skipped
229 ( 43.5% )
295 ( 56.0% )
3 ( 0.6% )
0 (0.0%)
QPID [character]
1. A Democrat
2. A Republican
3. An Independent
4. Skipped
5. Something else
176 ( 33.4% )
152 ( 28.8% )
168 ( 31.9% )
3 ( 0.6% )
28 ( 5.3% )
0 (0.0%)
ABCAGE [character]
1. 18-29
2. 30-49
3. 50-64
4. 65+
60 ( 11.4% )
148 ( 28.1% )
157 ( 29.8% )
162 ( 30.7% )
0 (0.0%)
Contact [character]
1. No, I am not willing to b
2. Yes, I am willing to be i
355 ( 67.4% )
172 ( 32.6% )
0 (0.0%)
weights_pid [numeric]
Mean (sd) : 1 (0.6)
min ≤ med ≤ max:
0.3 ≤ 0.8 ≤ 6.3
IQR (CV) : 0.5 (0.6)
453 distinct values 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.2)
2023-03-26

Any additional comments?

Identify variables that need to be mutated

To analyze or visualize the dataset, some of the string variables may need to be modified. For example, the party identification variable uses non-standard language, such as “A Democrat.” Additionally, the “skipped” response category should be treated as missing data. To address these issues, the variables may need to be recoded or modified for consistency and accuracy.

Code
#starting point
table(abc_poll_orig$QPID)

    A Democrat   A Republican An Independent        Skipped Something else 
           176            152            168              3             28 
Code
#mutate
abc_poll<-abc_poll_orig%>%
  mutate(partyid = str_remove(QPID, "A[n]* "),
         partyid = na_if(partyid, "Skipped"))%>%
  select(-QPID)

#sanity check
table(abc_poll$partyid)

      Democrat    Independent     Republican Something else 
           176            168            152             28 

Ethnic Identity

The ethnic identity variable in the dataset is lengthy and may be difficult to include in graphs or visualizations. However, it may be possible to modify the variable to make it more manageable, potentially by collapsing categories or creating new variables based on the original data. To ensure clarity and accuracy, it would be important to include a table note or other explanatory information that clarifies the meaning of the data labels, such as indicating that racial labels refer to non-Hispanic individuals, and that Hispanic responses do not necessarily indicate a specific race. By providing this contextual information, we can help to ensure that the dataset is accurately interpreted and understood by users.

Code
#starting point
table(abc_poll$ppethm)

2+ Races, Non-Hispanic    Black, Non-Hispanic               Hispanic 
                    21                     27                     51 
   Other, Non-Hispanic    White, Non-Hispanic 
                    24                    404 
Code
#mutate
abc_poll<-abc_poll%>%
  mutate(ethnic = str_remove(ppethm, ", Non-Hispanic"))%>%
  select(-ppethm)

#sanity check
table(abc_poll$ethnic)

2+ Races    Black Hispanic    Other    White 
      21       27       51       24      404 

Remove skipped

The political variables in the dataset all contain a “Skipped” value that should be replaced with “NA” for analysis. To facilitate this process, the “across” function can be used, which allows us to apply a function to multiple columns of a data frame at once. By using “across” to replace “Skipped” values with “NA” across all political variables, we can streamline the data cleaning process and ensure that the resulting dataset is suitable for analysis.

Code
abc_poll<-abc_poll%>%
  mutate(across(starts_with("Q"), ~ na_if(.x, "Skipped")))

map(select(abc_poll, starts_with("Q1")), table)
$Q1_a

   Approve Disapprove 
       329        193 

$Q1_b

   Approve Disapprove 
       192        322 

$Q1_c

   Approve Disapprove 
       272        248 

$Q1_d

   Approve Disapprove 
       192        321 

$Q1_e

   Approve Disapprove 
       212        301 

$Q1_f

   Approve Disapprove 
       281        230 

Factor Order

If we want a variable to appear in a specific order, we can re-level the variable by assigning new levels to the categories in the desired order. For example, the education variable in the dataset is currently arranged in alphabetical order, but we may want to re-level it so that the categories appear in a different order, such as by level of education attainment. By re-leveling the variable in this way, we can ensure that it appears in the desired order in any subsequent analyses or visualizations

Code
table(abc_poll$ppeducat)

Bachelors degree or higher                High school 
                       207                        133 
     Less than high school               Some college 
                        29                        158 
Code
edulabs <- unique(abc_poll$ppeducat)
edulabs
[1] "High school"                "Bachelors degree or higher"
[3] "Some college"               "Less than high school"     
Code
abc_poll<-abc_poll%>%
  mutate(educ = factor(ppeducat, 
                       levels=edulabs[c(4,1,3,2)]))%>%
  select(-ppeducat)
rm(edulabs)

table(abc_poll$educ)

     Less than high school                High school 
                        29                        133 
              Some college Bachelors degree or higher 
                       158                        207