Challenge 4

challenge_4

abc_poll

eggs

fed_rates

hotel_bookings

debt

Author

Thrishul

Published

August 18, 2022

Code

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
tidy data (as needed, including sanity checks)
identify variables that need to be mutated
mutate variables and sanity check all mutations

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

abc_poll.csv ⭐
poultry_tidy.xlsx or organiceggpoultry.xls⭐⭐
FedFundsRate.csv⭐⭐⭐
hotel_bookings.csv⭐⭐⭐⭐
debt_in_trillions.xlsx ⭐⭐⭐⭐⭐

Code

abc_poll_orig<-read_csv("_data/abc_poll_2021.csv")

# political questions
abc_poll_orig%>%
  select(starts_with("Q"))%>%
  colnames(.)

 [1] "Q1_a" "Q1_b" "Q1_c" "Q1_d" "Q1_e" "Q1_f" "Q2"   "Q3"   "Q4"   "Q5"  
[11] "QPID"

Code

# all but one demographer
abc_poll_orig%>%
  select(starts_with("pp"))%>%
  colnames(.)

 [1] "ppage"    "ppeduc5"  "ppeducat" "ppgender" "ppethm"   "pphhsize"
 [7] "ppinc7"   "ppmarit5" "ppmsacat" "ppreg4"   "pprent"   "ppstaten"
[13] "PPWORKA"  "ppemploy"

Code

# national poll
n_distinct(abc_poll_orig$ppstaten)

[1] 49

Briefly describe the data

The ABC Poll dataset is likely a national sample survey, conducted in 2019, that includes responses from 527 individuals. The survey covers a range of topics, including 10 questions related to political attitudes and beliefs, as well as party identification. Additionally, there are 15 demographic variables included in the dataset, some of which have been recoded to facilitate analysis. Finally, the dataset also includes 5 survey administration variables that provide information on the methodology and logistics of the survey administration. Overall, the dataset is a comprehensive collection of information on political attitudes and demographics in the surveyed population, and it offers a valuable resource for researchers and analysts interested in understanding these topics

Tidy Data (as needed)

Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.

Code

print(summarytools::dfSummary(abc_poll_orig,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

abc_poll_orig

Dimensions: 527 x 31
Duplicates: 0

id [numeric]

Mean (sd) : 7230264 (152.3)
min ≤ med ≤ max:
7230001 ≤ 7230264 ≤ 7230527
IQR (CV) : 263 (0)

527 distinct values

0 (0.0%)

xspanish [character]

1. English
2. Spanish

514	(	97.5%	)
13	(	2.5%	)

0 (0.0%)

complete_status [character]

1. qualified

527	(	100.0%	)

0 (0.0%)

ppage [numeric]

Mean (sd) : 53.4 (17.1)
min ≤ med ≤ max:
18 ≤ 55 ≤ 91
IQR (CV) : 27 (0.3)

72 distinct values

0 (0.0%)

ppeduc5 [character]

1. NA
2. NA
3. High school graduate (hig
4. No high school diploma or
5. Some college or Associate

99	(	18.8%	)
108	(	20.5%	)
133	(	25.2%	)
29	(	5.5%	)
158	(	30.0%	)

0 (0.0%)

ppeducat [character]

1. Bachelors degree or highe
2. High school
3. Less than high school
4. Some college

207	(	39.3%	)
133	(	25.2%	)
29	(	5.5%	)
158	(	30.0%	)

0 (0.0%)

ppgender [character]

1. Female
2. Male

254	(	48.2%	)
273	(	51.8%	)

0 (0.0%)

ppethm [character]

1. 2+ Races, Non-Hispanic
2. Black, Non-Hispanic
3. Hispanic
4. Other, Non-Hispanic
5. White, Non-Hispanic

21	(	4.0%	)
27	(	5.1%	)
51	(	9.7%	)
24	(	4.6%	)
404	(	76.7%	)

0 (0.0%)

pphhsize [character]

1. 1
2. 2
3. 3
4. 4
5. 5
6. 6 or more

80	(	15.2%	)
219	(	41.6%	)
102	(	19.4%	)
76	(	14.4%	)
35	(	6.6%	)
15	(	2.8%	)

0 (0.0%)

ppinc7 [character]

1. $10,000 to $24,999
2. $100,000 to $149,999
3. $150,000 or more
4. $25,000 to $49,999
5. $50,000 to $74,999
6. $75,000 to $99,999
7. Less than $10,000

32	(	6.1%	)
105	(	19.9%	)
137	(	26.0%	)
82	(	15.6%	)
85	(	16.1%	)
69	(	13.1%	)
17	(	3.2%	)

0 (0.0%)

ppmarit5 [character]

1. Divorced
2. Never married
3. Now Married
4. Separated
5. Widowed

43	(	8.2%	)
111	(	21.1%	)
337	(	63.9%	)
8	(	1.5%	)
28	(	5.3%	)

0 (0.0%)

ppmsacat [character]

1. Metro area
2. Non-metro area

448	(	85.0%	)
79	(	15.0%	)

0 (0.0%)

ppreg4 [character]

1. MidWest
2. NorthEast
3. South
4. West

118	(	22.4%	)
93	(	17.6%	)
190	(	36.1%	)
126	(	23.9%	)

0 (0.0%)

pprent [character]

1. Occupied without payment
2. Owned or being bought by
3. Rented for cash

10	(	1.9%	)
406	(	77.0%	)
111	(	21.1%	)

0 (0.0%)

ppstaten [character]

1. California
2. Texas
3. Florida
4. Pennsylvania
5. Illinois
6. New Jersey
7. Ohio
8. Michigan
9. New York
10. Washington
[ 39 others ]

51	(	9.7%	)
42	(	8.0%	)
34	(	6.5%	)
28	(	5.3%	)
23	(	4.4%	)
21	(	4.0%	)
21	(	4.0%	)
18	(	3.4%	)
18	(	3.4%	)
18	(	3.4%	)
253	(	48.0%	)

0 (0.0%)

PPWORKA [character]

1. Currently laid off
2. Employed full-time (by so
3. Employed part-time (by so
4. Full Time Student
5. Homemaker
6. On furlough
7. Other
8. Retired
9. Self-employed

13	(	2.5%	)
220	(	41.7%	)
31	(	5.9%	)
8	(	1.5%	)
37	(	7.0%	)
1	(	0.2%	)
20	(	3.8%	)
165	(	31.3%	)
32	(	6.1%	)

0 (0.0%)

ppemploy [character]

1. Not working
2. Working full-time
3. Working part-time

221	(	41.9%	)
245	(	46.5%	)
61	(	11.6%	)

0 (0.0%)

Q1_a [character]

1. Approve
2. Disapprove
3. Skipped

329	(	62.4%	)
193	(	36.6%	)
5	(	0.9%	)

0 (0.0%)

Q1_b [character]

1. Approve
2. Disapprove
3. Skipped

192	(	36.4%	)
322	(	61.1%	)
13	(	2.5%	)

0 (0.0%)

Q1_c [character]

1. Approve
2. Disapprove
3. Skipped

272	(	51.6%	)
248	(	47.1%	)
7	(	1.3%	)

0 (0.0%)

Q1_d [character]

1. Approve
2. Disapprove
3. Skipped

192	(	36.4%	)
321	(	60.9%	)
14	(	2.7%	)

0 (0.0%)

Q1_e [character]

1. Approve
2. Disapprove
3. Skipped

212	(	40.2%	)
301	(	57.1%	)
14	(	2.7%	)

0 (0.0%)

Q1_f [character]

1. Approve
2. Disapprove
3. Skipped

281	(	53.3%	)
230	(	43.6%	)
16	(	3.0%	)

0 (0.0%)

Q2 [character]

1. Not concerned at all
2. Not so concerned
3. Somewhat concerned
4. Very concerned

65	(	12.3%	)
147	(	27.9%	)
221	(	41.9%	)
94	(	17.8%	)

0 (0.0%)

Q3 [character]

1. No
2. Skipped
3. Yes

107	(	20.3%	)
5	(	0.9%	)
415	(	78.7%	)

0 (0.0%)

Q4 [character]

1. Excellent
2. Good
3. Not so good
4. Poor
5. Skipped

60	(	11.4%	)
215	(	40.8%	)
97	(	18.4%	)
149	(	28.3%	)
6	(	1.1%	)

0 (0.0%)

Q5 [character]

1. Optimistic
2. Pessimistic
3. Skipped

229	(	43.5%	)
295	(	56.0%	)
3	(	0.6%	)

0 (0.0%)

QPID [character]

1. A Democrat
2. A Republican
3. An Independent
4. Skipped
5. Something else

176	(	33.4%	)
152	(	28.8%	)
168	(	31.9%	)
3	(	0.6%	)
28	(	5.3%	)

0 (0.0%)

ABCAGE [character]

1. 18-29
2. 30-49
3. 50-64
4. 65+

60	(	11.4%	)
148	(	28.1%	)
157	(	29.8%	)
162	(	30.7%	)

0 (0.0%)

Contact [character]

1. No, I am not willing to b
2. Yes, I am willing to be i

355	(	67.4%	)
172	(	32.6%	)

0 (0.0%)

weights_pid [numeric]

Mean (sd) : 1 (0.6)
min ≤ med ≤ max:
0.3 ≤ 0.8 ≤ 6.3
IQR (CV) : 0.5 (0.6)

453 distinct values

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.2)
2023-03-26

Any additional comments?

Identify variables that need to be mutated

To analyze or visualize the dataset, some of the string variables may need to be modified. For example, the party identification variable uses non-standard language, such as “A Democrat.” Additionally, the “skipped” response category should be treated as missing data. To address these issues, the variables may need to be recoded or modified for consistency and accuracy.

Code

#starting point
table(abc_poll_orig$QPID)


    A Democrat   A Republican An Independent        Skipped Something else 
           176            152            168              3             28

Code

#mutate
abc_poll<-abc_poll_orig%>%
  mutate(partyid = str_remove(QPID, "A[n]* "),
         partyid = na_if(partyid, "Skipped"))%>%
  select(-QPID)

#sanity check
table(abc_poll$partyid)


      Democrat    Independent     Republican Something else 
           176            168            152             28

Ethnic Identity

The ethnic identity variable in the dataset is lengthy and may be difficult to include in graphs or visualizations. However, it may be possible to modify the variable to make it more manageable, potentially by collapsing categories or creating new variables based on the original data. To ensure clarity and accuracy, it would be important to include a table note or other explanatory information that clarifies the meaning of the data labels, such as indicating that racial labels refer to non-Hispanic individuals, and that Hispanic responses do not necessarily indicate a specific race. By providing this contextual information, we can help to ensure that the dataset is accurately interpreted and understood by users.

Code

#starting point
table(abc_poll$ppethm)


2+ Races, Non-Hispanic    Black, Non-Hispanic               Hispanic 
                    21                     27                     51 
   Other, Non-Hispanic    White, Non-Hispanic 
                    24                    404

Code

#mutate
abc_poll<-abc_poll%>%
  mutate(ethnic = str_remove(ppethm, ", Non-Hispanic"))%>%
  select(-ppethm)

#sanity check
table(abc_poll$ethnic)


2+ Races    Black Hispanic    Other    White 
      21       27       51       24      404

Remove skipped

The political variables in the dataset all contain a “Skipped” value that should be replaced with “NA” for analysis. To facilitate this process, the “across” function can be used, which allows us to apply a function to multiple columns of a data frame at once. By using “across” to replace “Skipped” values with “NA” across all political variables, we can streamline the data cleaning process and ensure that the resulting dataset is suitable for analysis.

Code

abc_poll<-abc_poll%>%
  mutate(across(starts_with("Q"), ~ na_if(.x, "Skipped")))

map(select(abc_poll, starts_with("Q1")), table)

$Q1_a

   Approve Disapprove 
       329        193 

$Q1_b

   Approve Disapprove 
       192        322 

$Q1_c

   Approve Disapprove 
       272        248 

$Q1_d

   Approve Disapprove 
       192        321 

$Q1_e

   Approve Disapprove 
       212        301 

$Q1_f

   Approve Disapprove 
       281        230

Factor Order

If we want a variable to appear in a specific order, we can re-level the variable by assigning new levels to the categories in the desired order. For example, the education variable in the dataset is currently arranged in alphabetical order, but we may want to re-level it so that the categories appear in a different order, such as by level of education attainment. By re-leveling the variable in this way, we can ensure that it appears in the desired order in any subsequent analyses or visualizations

Code

table(abc_poll$ppeducat)


Bachelors degree or higher                High school 
                       207                        133 
     Less than high school               Some college 
                        29                        158

Code

edulabs <- unique(abc_poll$ppeducat)
edulabs

[1] "High school"                "Bachelors degree or higher"
[3] "Some college"               "Less than high school"

Code

abc_poll<-abc_poll%>%
  mutate(educ = factor(ppeducat, 
                       levels=edulabs[c(4,1,3,2)]))%>%
  select(-ppeducat)
rm(edulabs)

table(abc_poll$educ)


     Less than high school                High school 
                        29                        133 
              Some college Bachelors degree or higher 
                       158                        207