Challenge 4

challenge_4

abc_poll

eggs

fed_rates

hotel_bookings

debt

Author

Michaela Bowen

Published

October 6, 2022

Code

library(tidyverse)
library(lubridate)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

In today’s Challenge I have attempted to:

read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
tidy data (as needed, including sanity checks)
identify variables that need to be mutated
mutate variables and sanity check all mutations

Read in data

Today I am reading in:

abc_poll.csv ⭐

Code

library(readr)
abc_poll <- read_csv("_data/abc_poll_2021.csv")

Briefly describe the data

Code

library(summarytools)
##dataframe summary
print(summarytools::dfSummary(abc_poll,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

abc_poll

Dimensions: 527 x 31
Duplicates: 0

Variable

Stats / Values

Freqs (% of Valid)

Graph

Missing

id [numeric]

Mean (sd) : 7230264 (152.3)

min ≤ med ≤ max:

7230001 ≤ 7230264 ≤ 7230527

IQR (CV) : 263 (0)

527 distinct values

0 (0.0%)

xspanish [character]

1. English

2. Spanish

514	(	97.5%	)
13	(	2.5%	)

0 (0.0%)

complete_status [character]

1. qualified

527

(

100.0%

)

0 (0.0%)

ppage [numeric]

Mean (sd) : 53.4 (17.1)

min ≤ med ≤ max:

18 ≤ 55 ≤ 91

IQR (CV) : 27 (0.3)

72 distinct values

0 (0.0%)

ppeduc5 [character]

1. NA

2. NA

3. High school graduate (hig

4. No high school diploma or

5. Some college or Associate

99	(	18.8%	)
108	(	20.5%	)
133	(	25.2%	)
29	(	5.5%	)
158	(	30.0%	)

0 (0.0%)

ppeducat [character]

1. Bachelors degree or highe

2. High school

3. Less than high school

4. Some college

207	(	39.3%	)
133	(	25.2%	)
29	(	5.5%	)
158	(	30.0%	)

0 (0.0%)

ppgender [character]

1. Female

2. Male

254	(	48.2%	)
273	(	51.8%	)

0 (0.0%)

ppethm [character]

1. 2+ Races, Non-Hispanic

2. Black, Non-Hispanic

3. Hispanic

4. Other, Non-Hispanic

5. White, Non-Hispanic

21	(	4.0%	)
27	(	5.1%	)
51	(	9.7%	)
24	(	4.6%	)
404	(	76.7%	)

0 (0.0%)

pphhsize [character]

1. 1

2. 2

3. 3

4. 4

5. 5

6. 6 or more

80	(	15.2%	)
219	(	41.6%	)
102	(	19.4%	)
76	(	14.4%	)
35	(	6.6%	)
15	(	2.8%	)

0 (0.0%)

ppinc7 [character]

1. $10,000 to $24,999

2. $100,000 to $149,999

3. $150,000 or more

4. $25,000 to $49,999

5. $50,000 to $74,999

6. $75,000 to $99,999

7. Less than $10,000

32	(	6.1%	)
105	(	19.9%	)
137	(	26.0%	)
82	(	15.6%	)
85	(	16.1%	)
69	(	13.1%	)
17	(	3.2%	)

0 (0.0%)

ppmarit5 [character]

1. Divorced

2. Never married

3. Now Married

4. Separated

5. Widowed

43	(	8.2%	)
111	(	21.1%	)
337	(	63.9%	)
8	(	1.5%	)
28	(	5.3%	)

0 (0.0%)

ppmsacat [character]

1. Metro area

2. Non-metro area

448	(	85.0%	)
79	(	15.0%	)

0 (0.0%)

ppreg4 [character]

1. MidWest

2. NorthEast

3. South

4. West

118	(	22.4%	)
93	(	17.6%	)
190	(	36.1%	)
126	(	23.9%	)

0 (0.0%)

pprent [character]

1. Occupied without payment

2. Owned or being bought by

3. Rented for cash

10	(	1.9%	)
406	(	77.0%	)
111	(	21.1%	)

0 (0.0%)

ppstaten [character]

1. California

2. Texas

3. Florida

4. Pennsylvania

5. Illinois

6. New Jersey

7. Ohio

8. Michigan

9. New York

10. Washington

[ 39 others ]

51	(	9.7%	)
42	(	8.0%	)
34	(	6.5%	)
28	(	5.3%	)
23	(	4.4%	)
21	(	4.0%	)
21	(	4.0%	)
18	(	3.4%	)
18	(	3.4%	)
18	(	3.4%	)
253	(	48.0%	)

0 (0.0%)

PPWORKA [character]

1. Currently laid off

2. Employed full-time (by so

3. Employed part-time (by so

4. Full Time Student

5. Homemaker

6. On furlough

7. Other

8. Retired

9. Self-employed

13	(	2.5%	)
220	(	41.7%	)
31	(	5.9%	)
8	(	1.5%	)
37	(	7.0%	)
1	(	0.2%	)
20	(	3.8%	)
165	(	31.3%	)
32	(	6.1%	)

0 (0.0%)

ppemploy [character]

1. Not working

2. Working full-time

3. Working part-time

221	(	41.9%	)
245	(	46.5%	)
61	(	11.6%	)

0 (0.0%)

Q1_a [character]

1. Approve

2. Disapprove

3. Skipped

329	(	62.4%	)
193	(	36.6%	)
5	(	0.9%	)

0 (0.0%)

Q1_b [character]

1. Approve

2. Disapprove

3. Skipped

192	(	36.4%	)
322	(	61.1%	)
13	(	2.5%	)

0 (0.0%)

Q1_c [character]

1. Approve

2. Disapprove

3. Skipped

272	(	51.6%	)
248	(	47.1%	)
7	(	1.3%	)

0 (0.0%)

Q1_d [character]

1. Approve

2. Disapprove

3. Skipped

192	(	36.4%	)
321	(	60.9%	)
14	(	2.7%	)

0 (0.0%)

Q1_e [character]

1. Approve

2. Disapprove

3. Skipped

212	(	40.2%	)
301	(	57.1%	)
14	(	2.7%	)

0 (0.0%)

Q1_f [character]

1. Approve

2. Disapprove

3. Skipped

281	(	53.3%	)
230	(	43.6%	)
16	(	3.0%	)

0 (0.0%)

Q2 [character]

1. Not concerned at all

2. Not so concerned

3. Somewhat concerned

4. Very concerned

65	(	12.3%	)
147	(	27.9%	)
221	(	41.9%	)
94	(	17.8%	)

0 (0.0%)

Q3 [character]

1. No

2. Skipped

3. Yes

107	(	20.3%	)
5	(	0.9%	)
415	(	78.7%	)

0 (0.0%)

Q4 [character]

1. Excellent

2. Good

3. Not so good

4. Poor

5. Skipped

60	(	11.4%	)
215	(	40.8%	)
97	(	18.4%	)
149	(	28.3%	)
6	(	1.1%	)

0 (0.0%)

Q5 [character]

1. Optimistic

2. Pessimistic

3. Skipped

229	(	43.5%	)
295	(	56.0%	)
3	(	0.6%	)

0 (0.0%)

QPID [character]

1. A Democrat

2. A Republican

3. An Independent

4. Skipped

5. Something else

176	(	33.4%	)
152	(	28.8%	)
168	(	31.9%	)
3	(	0.6%	)
28	(	5.3%	)

0 (0.0%)

ABCAGE [character]

1. 18-29

2. 30-49

3. 50-64

4. 65+

60	(	11.4%	)
148	(	28.1%	)
157	(	29.8%	)
162	(	30.7%	)

0 (0.0%)

Contact [character]

1. No, I am not willing to b

2. Yes, I am willing to be i

355	(	67.4%	)
172	(	32.6%	)

0 (0.0%)

weights_pid [numeric]

Mean (sd) : 1 (0.6)

min ≤ med ≤ max:

0.3 ≤ 0.8 ≤ 6.3

IQR (CV) : 0.5 (0.6)

453 distinct values

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-11-26

Code

##Identify Participant Demographic Variables
abc_poll%>%
  select(starts_with("pp"))%>%
  colnames()

 [1] "ppage"    "ppeduc5"  "ppeducat" "ppgender" "ppethm"   "pphhsize"
 [7] "ppinc7"   "ppmarit5" "ppmsacat" "ppreg4"   "pprent"   "ppstaten"
[13] "PPWORKA"  "ppemploy"

Code

##Identify Political Question Variables
abc_poll%>%
  select(starts_with("Q"))%>%
  colnames()

 [1] "Q1_a" "Q1_b" "Q1_c" "Q1_d" "Q1_e" "Q1_f" "Q2"   "Q3"   "Q4"   "Q5"  
[11] "QPID"

The ABC Poll Data is survey data from ABC new political polls. This data looks to be from 2021. After investigating the data we can see a few things:

There are demographic variables, such as education level, household size, income level, among other identifying information. We can also see that each of the 527 respondents have a unique “id”, so there are no double respondents.
There are 10 questions that ask political sentiments (Q_1 (a-f) - Q_5) as well as one asking political affiliation.

Tidy Data

There are a few ways I would like to tidy this data while reading it in

Rename columns: I will be renaming the “ID”, “complete_status”,and “pp_educat”(redundant information) columns to “delete” in order to remove those columns seamlessly. Other demographic columns will be renamed according to the information they contain.
Here is how I will mutate the following columns

Ethnicity column responses all, but Hispanic, contain “non-Hispanic” so it is redundant and lengthy to include that in each response
Political Identity column needs to be amended from including “An” and “A” to simply the party name. The “skipped” response also will need to be change to NA to signify missing data rather than classifying “skipped” as a response.
Education column responses must be mutated to exclude the apostrophe in the Bachelor’s and Master’s columns as they are producing “NA” where we don’t want that

Code

abc_poll_tidy <- read_csv("_data/abc_poll_2021.csv",
                         skip = 1,
                         col_names =                           c("delete","language","delete","pp_age","pp_education","delete","pp_gender","pp_ethnicity", "pp_house_size","pp_incomelvl","pp_marital_status","pp_metrocat","pp_region","pp_rent","pp_state","pp_working_status","pp_employment","q_1_a","q_1_b","q_1_c","q_1_d","q_1_e","q_1_f","q_2","q_3","q_4","q_5","pp_partyid","pp_age_range","pp_can_contact","weights"))%>%
  select(!contains("delete"))%>%
  #removing redundant "Non-Hispanic" identification from ethnicity column
  mutate(pp_ethnicity = str_remove(pp_ethnicity, ", Non-Hispanic"))%>%
  #mutating political identity column to remove "An" and "A"
  mutate(
    pp_partyid = str_remove(pp_partyid, "A[n]*"),
    pp_partyid = na_if(pp_partyid, "Skipped")
  )%>%
  mutate(
    pp_education = str_remove(pp_education, "\x92")
  )
#Sanity Checks
prop.table(table(abc_poll_tidy$pp_partyid))


      Democrat    Independent     Republican Something else 
    0.33587786     0.32061069     0.29007634     0.05343511

Code

prop.table(table(abc_poll_tidy$pp_ethnicity))


  2+ Races      Black   Hispanic      Other      White 
0.03984820 0.05123340 0.09677419 0.04554080 0.76660342

Code

prop.table(table(abc_poll_tidy$pp_education))


                                                Bachelors degree 
                                                      0.20493359 
High school graduate (high school diploma or the equivalent GED) 
                                                      0.25237192 
                                         Masters degree or above 
                                                      0.18785579 
                                   No high school diploma or GED 
                                                      0.05502846 
                                Some college or Associate degree 
                                                      0.29981025

Identify variables that need to be mutated

Are there any variables that require mutation to be usable in your analysis stream? For example, are all time variables correctly coded as dates? Are all string variables reduced and cleaned to sensible categories? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?

For most streamline analysis there are several columns that must mutated. pp_incomelvl: - Change from numeric data, to categorical data giving each income bracket a category

Document your work here.

Code

abc_poll_tidy <- abc_poll_tidy%>%
  #categorizing income level
  mutate(pp_incomelvl = case_when(
    pp_incomelvl == "Less than $10,000" ~ "Income Level 1",
    pp_incomelvl == "$10,000 to $24,999" ~ "Income Level 2",
    pp_incomelvl == "$25,000 to $49,999" ~ "Income Level 3",
    pp_incomelvl == "$50,000 to $74,999" ~ "Income Level 4",
    pp_incomelvl == "$75,000 to $99,999" ~ "Income Level 5",
    pp_incomelvl == "$100,000 to $149,999" ~ "Income Level 6",
    pp_incomelvl == "$150,000 or more" ~ "Income Level 7"
  ))

Any additional comments? Here is the final dataframe

Code

print(summarytools::dfSummary(abc_poll,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

abc_poll

Dimensions: 527 x 31
Duplicates: 0

Variable

Stats / Values

Freqs (% of Valid)

Graph

Missing

id [numeric]

Mean (sd) : 7230264 (152.3)

min ≤ med ≤ max:

7230001 ≤ 7230264 ≤ 7230527

IQR (CV) : 263 (0)

527 distinct values

0 (0.0%)

xspanish [character]

1. English

2. Spanish

514	(	97.5%	)
13	(	2.5%	)

0 (0.0%)

complete_status [character]

1. qualified

527

(

100.0%

)

0 (0.0%)

ppage [numeric]

Mean (sd) : 53.4 (17.1)

min ≤ med ≤ max:

18 ≤ 55 ≤ 91

IQR (CV) : 27 (0.3)

72 distinct values

0 (0.0%)

ppeduc5 [character]

1. NA

2. NA

3. High school graduate (hig

4. No high school diploma or

5. Some college or Associate

99	(	18.8%	)
108	(	20.5%	)
133	(	25.2%	)
29	(	5.5%	)
158	(	30.0%	)

0 (0.0%)

ppeducat [character]

1. Bachelors degree or highe

2. High school

3. Less than high school

4. Some college

207	(	39.3%	)
133	(	25.2%	)
29	(	5.5%	)
158	(	30.0%	)

0 (0.0%)

ppgender [character]

1. Female

2. Male

254	(	48.2%	)
273	(	51.8%	)

0 (0.0%)

ppethm [character]

1. 2+ Races, Non-Hispanic

2. Black, Non-Hispanic

3. Hispanic

4. Other, Non-Hispanic

5. White, Non-Hispanic

21	(	4.0%	)
27	(	5.1%	)
51	(	9.7%	)
24	(	4.6%	)
404	(	76.7%	)

0 (0.0%)

pphhsize [character]

1. 1

2. 2

3. 3

4. 4

5. 5

6. 6 or more

80	(	15.2%	)
219	(	41.6%	)
102	(	19.4%	)
76	(	14.4%	)
35	(	6.6%	)
15	(	2.8%	)

0 (0.0%)

ppinc7 [character]

1. $10,000 to $24,999

2. $100,000 to $149,999

3. $150,000 or more

4. $25,000 to $49,999

5. $50,000 to $74,999

6. $75,000 to $99,999

7. Less than $10,000

32	(	6.1%	)
105	(	19.9%	)
137	(	26.0%	)
82	(	15.6%	)
85	(	16.1%	)
69	(	13.1%	)
17	(	3.2%	)

0 (0.0%)

ppmarit5 [character]

1. Divorced

2. Never married

3. Now Married

4. Separated

5. Widowed

43	(	8.2%	)
111	(	21.1%	)
337	(	63.9%	)
8	(	1.5%	)
28	(	5.3%	)

0 (0.0%)

ppmsacat [character]

1. Metro area

2. Non-metro area

448	(	85.0%	)
79	(	15.0%	)

0 (0.0%)

ppreg4 [character]

1. MidWest

2. NorthEast

3. South

4. West

118	(	22.4%	)
93	(	17.6%	)
190	(	36.1%	)
126	(	23.9%	)

0 (0.0%)

pprent [character]

1. Occupied without payment

2. Owned or being bought by

3. Rented for cash

10	(	1.9%	)
406	(	77.0%	)
111	(	21.1%	)

0 (0.0%)

ppstaten [character]

1. California

2. Texas

3. Florida

4. Pennsylvania

5. Illinois

6. New Jersey

7. Ohio

8. Michigan

9. New York

10. Washington

[ 39 others ]

51	(	9.7%	)
42	(	8.0%	)
34	(	6.5%	)
28	(	5.3%	)
23	(	4.4%	)
21	(	4.0%	)
21	(	4.0%	)
18	(	3.4%	)
18	(	3.4%	)
18	(	3.4%	)
253	(	48.0%	)

0 (0.0%)

PPWORKA [character]

1. Currently laid off

2. Employed full-time (by so

3. Employed part-time (by so

4. Full Time Student

5. Homemaker

6. On furlough

7. Other

8. Retired

9. Self-employed

13	(	2.5%	)
220	(	41.7%	)
31	(	5.9%	)
8	(	1.5%	)
37	(	7.0%	)
1	(	0.2%	)
20	(	3.8%	)
165	(	31.3%	)
32	(	6.1%	)

0 (0.0%)

ppemploy [character]

1. Not working

2. Working full-time

3. Working part-time

221	(	41.9%	)
245	(	46.5%	)
61	(	11.6%	)

0 (0.0%)

Q1_a [character]

1. Approve

2. Disapprove

3. Skipped

329	(	62.4%	)
193	(	36.6%	)
5	(	0.9%	)

0 (0.0%)

Q1_b [character]

1. Approve

2. Disapprove

3. Skipped

192	(	36.4%	)
322	(	61.1%	)
13	(	2.5%	)

0 (0.0%)

Q1_c [character]

1. Approve

2. Disapprove

3. Skipped

272	(	51.6%	)
248	(	47.1%	)
7	(	1.3%	)

0 (0.0%)

Q1_d [character]

1. Approve

2. Disapprove

3. Skipped

192	(	36.4%	)
321	(	60.9%	)
14	(	2.7%	)

0 (0.0%)

Q1_e [character]

1. Approve

2. Disapprove

3. Skipped

212	(	40.2%	)
301	(	57.1%	)
14	(	2.7%	)

0 (0.0%)

Q1_f [character]

1. Approve

2. Disapprove

3. Skipped

281	(	53.3%	)
230	(	43.6%	)
16	(	3.0%	)

0 (0.0%)

Q2 [character]

1. Not concerned at all

2. Not so concerned

3. Somewhat concerned

4. Very concerned

65	(	12.3%	)
147	(	27.9%	)
221	(	41.9%	)
94	(	17.8%	)

0 (0.0%)

Q3 [character]

1. No

2. Skipped

3. Yes

107	(	20.3%	)
5	(	0.9%	)
415	(	78.7%	)

0 (0.0%)

Q4 [character]

1. Excellent

2. Good

3. Not so good

4. Poor

5. Skipped

60	(	11.4%	)
215	(	40.8%	)
97	(	18.4%	)
149	(	28.3%	)
6	(	1.1%	)

0 (0.0%)

Q5 [character]

1. Optimistic

2. Pessimistic

3. Skipped

229	(	43.5%	)
295	(	56.0%	)
3	(	0.6%	)

0 (0.0%)

QPID [character]

1. A Democrat

2. A Republican

3. An Independent

4. Skipped

5. Something else

176	(	33.4%	)
152	(	28.8%	)
168	(	31.9%	)
3	(	0.6%	)
28	(	5.3%	)

0 (0.0%)

ABCAGE [character]

1. 18-29

2. 30-49

3. 50-64

4. 65+

60	(	11.4%	)
148	(	28.1%	)
157	(	29.8%	)
162	(	30.7%	)

0 (0.0%)

Contact [character]

1. No, I am not willing to b

2. Yes, I am willing to be i

355	(	67.4%	)
172	(	32.6%	)

0 (0.0%)

weights_pid [numeric]

Mean (sd) : 1 (0.6)

min ≤ med ≤ max:

0.3 ≤ 0.8 ≤ 6.3

IQR (CV) : 0.5 (0.6)

453 distinct values

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-11-26