DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge-4

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Read Data
  • Briefly describe the data

Challenge-4

  • Show All Code
  • Hide All Code

  • View Source
challenge_4
abc_poll
Author

Said Arslan

Published

October 12, 2022

Code
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.5.0 
✔ readr   2.1.3      ✔ forcats 0.5.2 
Warning: package 'ggplot2' was built under R version 4.2.2
Warning: package 'stringr' was built under R version 4.2.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Code
library(summarytools)

Attaching package: 'summarytools'

The following object is masked from 'package:tibble':

    view
Code
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read Data

Code
abc.poll <- read_csv("_data/abc_poll_2021.csv")
head(abc.poll)
# A tibble: 6 × 31
       id xspanish complet…¹ ppage ppeduc5 ppedu…² ppgen…³ ppethm pphhs…⁴ ppinc7
    <dbl> <chr>    <chr>     <dbl> <chr>   <chr>   <chr>   <chr>  <chr>   <chr> 
1 7230001 English  qualified    68 "High … High s… Female  White… 2       $25,0…
2 7230002 English  qualified    85 "Bache… Bachel… Male    White… 2       $150,…
3 7230003 English  qualified    69 "High … High s… Male    White… 2       $100,…
4 7230004 English  qualified    74 "Bache… Bachel… Female  White… 1       $25,0…
5 7230005 English  qualified    77 "High … High s… Male    White… 3       $10,0…
6 7230006 English  qualified    70 "Bache… Bachel… Male    White… 2       $75,0…
# … with 21 more variables: ppmarit5 <chr>, ppmsacat <chr>, ppreg4 <chr>,
#   pprent <chr>, ppstaten <chr>, PPWORKA <chr>, ppemploy <chr>, Q1_a <chr>,
#   Q1_b <chr>, Q1_c <chr>, Q1_d <chr>, Q1_e <chr>, Q1_f <chr>, Q2 <chr>,
#   Q3 <chr>, Q4 <chr>, Q5 <chr>, QPID <chr>, ABCAGE <chr>, Contact <chr>,
#   weights_pid <dbl>, and abbreviated variable names ¹​complete_status,
#   ²​ppeducat, ³​ppgender, ⁴​pphhsize
Code
sample_n(abc.poll, 10)
# A tibble: 10 × 31
        id xspanish comple…¹ ppage ppeduc5 ppedu…² ppgen…³ ppethm pphhs…⁴ ppinc7
     <dbl> <chr>    <chr>    <dbl> <chr>   <chr>   <chr>   <chr>  <chr>   <chr> 
 1 7230038 English  qualifi…    44 "Bache… Bachel… Female  Other… 5       $150,…
 2 7230307 English  qualifi…    28 "No hi… Less t… Male    Hispa… 6 or m… $75,0…
 3 7230091 English  qualifi…    63 "Maste… Bachel… Male    Hispa… 2       $150,…
 4 7230087 English  qualifi…    56 "Bache… Bachel… Female  Other… 3       $75,0…
 5 7230228 English  qualifi…    19 "Some … Some c… Female  2+ Ra… 4       $50,0…
 6 7230405 English  qualifi…    71 "Bache… Bachel… Male    White… 2       $50,0…
 7 7230186 English  qualifi…    76 "Maste… Bachel… Female  White… 1       $75,0…
 8 7230164 English  qualifi…    33 "Maste… Bachel… Female  White… 2       $100,…
 9 7230191 English  qualifi…    52 "High … High s… Female  White… 1       $25,0…
10 7230485 English  qualifi…    70 "Maste… Bachel… Female  White… 2       $100,…
# … with 21 more variables: ppmarit5 <chr>, ppmsacat <chr>, ppreg4 <chr>,
#   pprent <chr>, ppstaten <chr>, PPWORKA <chr>, ppemploy <chr>, Q1_a <chr>,
#   Q1_b <chr>, Q1_c <chr>, Q1_d <chr>, Q1_e <chr>, Q1_f <chr>, Q2 <chr>,
#   Q3 <chr>, Q4 <chr>, Q5 <chr>, QPID <chr>, ABCAGE <chr>, Contact <chr>,
#   weights_pid <dbl>, and abbreviated variable names ¹​complete_status,
#   ²​ppeducat, ³​ppgender, ⁴​pphhsize

Briefly describe the data

Code
print(dfSummary(abc.poll, 
                varnumbers= FALSE, 
                plain.ascii= FALSE, 
                style= "grid", 
                graph.magnif= 0.80, 
                valid.col= FALSE),
      method= 'render', 
      table.classes= 'table-condensed')

Data Frame Summary

abc.poll

Dimensions: 527 x 31
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
id [numeric]
Mean (sd) : 7230264 (152.3)
min ≤ med ≤ max:
7230001 ≤ 7230264 ≤ 7230527
IQR (CV) : 263 (0)
527 distinct values 0 (0.0%)
xspanish [character]
1. English
2. Spanish
514(97.5%)
13(2.5%)
0 (0.0%)
complete_status [character] 1. qualified
527(100.0%)
0 (0.0%)
ppage [numeric]
Mean (sd) : 53.4 (17.1)
min ≤ med ≤ max:
18 ≤ 55 ≤ 91
IQR (CV) : 27 (0.3)
72 distinct values 0 (0.0%)
ppeduc5 [character]
1. NA
2. NA
3. High school graduate (hig
4. No high school diploma or
5. Some college or Associate
99(18.8%)
108(20.5%)
133(25.2%)
29(5.5%)
158(30.0%)
0 (0.0%)
ppeducat [character]
1. Bachelors degree or highe
2. High school
3. Less than high school
4. Some college
207(39.3%)
133(25.2%)
29(5.5%)
158(30.0%)
0 (0.0%)
ppgender [character]
1. Female
2. Male
254(48.2%)
273(51.8%)
0 (0.0%)
ppethm [character]
1. 2+ Races, Non-Hispanic
2. Black, Non-Hispanic
3. Hispanic
4. Other, Non-Hispanic
5. White, Non-Hispanic
21(4.0%)
27(5.1%)
51(9.7%)
24(4.6%)
404(76.7%)
0 (0.0%)
pphhsize [character]
1. 1
2. 2
3. 3
4. 4
5. 5
6. 6 or more
80(15.2%)
219(41.6%)
102(19.4%)
76(14.4%)
35(6.6%)
15(2.8%)
0 (0.0%)
ppinc7 [character]
1. $10,000 to $24,999
2. $100,000 to $149,999
3. $150,000 or more
4. $25,000 to $49,999
5. $50,000 to $74,999
6. $75,000 to $99,999
7. Less than $10,000
32(6.1%)
105(19.9%)
137(26.0%)
82(15.6%)
85(16.1%)
69(13.1%)
17(3.2%)
0 (0.0%)
ppmarit5 [character]
1. Divorced
2. Never married
3. Now Married
4. Separated
5. Widowed
43(8.2%)
111(21.1%)
337(63.9%)
8(1.5%)
28(5.3%)
0 (0.0%)
ppmsacat [character]
1. Metro area
2. Non-metro area
448(85.0%)
79(15.0%)
0 (0.0%)
ppreg4 [character]
1. MidWest
2. NorthEast
3. South
4. West
118(22.4%)
93(17.6%)
190(36.1%)
126(23.9%)
0 (0.0%)
pprent [character]
1. Occupied without payment
2. Owned or being bought by
3. Rented for cash
10(1.9%)
406(77.0%)
111(21.1%)
0 (0.0%)
ppstaten [character]
1. California
2. Texas
3. Florida
4. Pennsylvania
5. Illinois
6. New Jersey
7. Ohio
8. Michigan
9. New York
10. Washington
[ 39 others ]
51(9.7%)
42(8.0%)
34(6.5%)
28(5.3%)
23(4.4%)
21(4.0%)
21(4.0%)
18(3.4%)
18(3.4%)
18(3.4%)
253(48.0%)
0 (0.0%)
PPWORKA [character]
1. Currently laid off
2. Employed full-time (by so
3. Employed part-time (by so
4. Full Time Student
5. Homemaker
6. On furlough
7. Other
8. Retired
9. Self-employed
13(2.5%)
220(41.7%)
31(5.9%)
8(1.5%)
37(7.0%)
1(0.2%)
20(3.8%)
165(31.3%)
32(6.1%)
0 (0.0%)
ppemploy [character]
1. Not working
2. Working full-time
3. Working part-time
221(41.9%)
245(46.5%)
61(11.6%)
0 (0.0%)
Q1_a [character]
1. Approve
2. Disapprove
3. Skipped
329(62.4%)
193(36.6%)
5(0.9%)
0 (0.0%)
Q1_b [character]
1. Approve
2. Disapprove
3. Skipped
192(36.4%)
322(61.1%)
13(2.5%)
0 (0.0%)
Q1_c [character]
1. Approve
2. Disapprove
3. Skipped
272(51.6%)
248(47.1%)
7(1.3%)
0 (0.0%)
Q1_d [character]
1. Approve
2. Disapprove
3. Skipped
192(36.4%)
321(60.9%)
14(2.7%)
0 (0.0%)
Q1_e [character]
1. Approve
2. Disapprove
3. Skipped
212(40.2%)
301(57.1%)
14(2.7%)
0 (0.0%)
Q1_f [character]
1. Approve
2. Disapprove
3. Skipped
281(53.3%)
230(43.6%)
16(3.0%)
0 (0.0%)
Q2 [character]
1. Not concerned at all
2. Not so concerned
3. Somewhat concerned
4. Very concerned
65(12.3%)
147(27.9%)
221(41.9%)
94(17.8%)
0 (0.0%)
Q3 [character]
1. No
2. Skipped
3. Yes
107(20.3%)
5(0.9%)
415(78.7%)
0 (0.0%)
Q4 [character]
1. Excellent
2. Good
3. Not so good
4. Poor
5. Skipped
60(11.4%)
215(40.8%)
97(18.4%)
149(28.3%)
6(1.1%)
0 (0.0%)
Q5 [character]
1. Optimistic
2. Pessimistic
3. Skipped
229(43.5%)
295(56.0%)
3(0.6%)
0 (0.0%)
QPID [character]
1. A Democrat
2. A Republican
3. An Independent
4. Skipped
5. Something else
176(33.4%)
152(28.8%)
168(31.9%)
3(0.6%)
28(5.3%)
0 (0.0%)
ABCAGE [character]
1. 18-29
2. 30-49
3. 50-64
4. 65+
60(11.4%)
148(28.1%)
157(29.8%)
162(30.7%)
0 (0.0%)
Contact [character]
1. No, I am not willing to b
2. Yes, I am willing to be i
355(67.4%)
172(32.6%)
0 (0.0%)
weights_pid [numeric]
Mean (sd) : 1 (0.6)
min ≤ med ≤ max:
0.3 ≤ 0.8 ≤ 6.3
IQR (CV) : 0.5 (0.6)
453 distinct values 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-20

The dataset should be from a political survey. There are 527 rows and 31 columns. Each row(observation) contains information about a survey respondent as well as his/her answers to survey questions. 3 of 31 variables are ‘numeric’ variables which are id, ppage and weights_pid. All of the rest are ‘character’ variables. Actually most of these character variables should be coded as ‘factor’ variable because they are essentially categorical variables that could have a few possible values.

All variable names that start with “Q” are survey questions.

Code
colnames(select(abc.poll, starts_with("Q")))
 [1] "Q1_a" "Q1_b" "Q1_c" "Q1_d" "Q1_e" "Q1_f" "Q2"   "Q3"   "Q4"   "Q5"  
[11] "QPID"
Code
length(colnames(select(abc.poll, starts_with("Q"))))
[1] 11

So, there are 11 survey questions.

All variable names that start with “pp” contain demographic information about respondents.

Code
colnames(select(abc.poll, starts_with("pp")))
 [1] "ppage"    "ppeduc5"  "ppeducat" "ppgender" "ppethm"   "pphhsize"
 [7] "ppinc7"   "ppmarit5" "ppmsacat" "ppreg4"   "pprent"   "ppstaten"
[13] "PPWORKA"  "ppemploy"
Code
length(colnames(select(abc.poll, starts_with("pp"))))
[1] 14

So, there 14 variables identifying respondents’ demographic characteristics. variable complete_status could be dropped from the dataset as all observations in the dataset have same value, ’qualified.

Code
abc.poll <- abc.poll %>% select(-complete_status)

When we look at the Data Frame Summary table above, we can see that there are no missing values in the dataset; however, two values of ppeduc5 variable cannot be displayed and identified as NA.

Code
table(abc.poll$ppeduc5)

                                     Master\x92s degree or above 
                                                              99 
                                            Bachelor\x92s degree 
                                                             108 
High school graduate (high school diploma or the equivalent GED) 
                                                             133 
                                   No high school diploma or GED 
                                                              29 
                                Some college or Associate degree 
                                                             158 

As it can be seen above, there is an issue with properly reading these two string values. Their correct values should be “Bachelor’s degree” and “Master’s degree or above”. Let me fix it.

Code
abc.poll$ppeduc5[startsWith(abc.poll$ppeduc5, "Bac")] <- "Bachelor's degree"
abc.poll$ppeduc5[startsWith(abc.poll$ppeduc5, "Mas")] <- "Master's degree or above"

table(abc.poll$ppeduc5)

                                               Bachelor's degree 
                                                             108 
High school graduate (high school diploma or the equivalent GED) 
                                                             133 
                                        Master's degree or above 
                                                              99 
                                   No high school diploma or GED 
                                                              29 
                                Some college or Associate degree 
                                                             158 

Also it would be better if we redefine ppeduc5 variable so that its 5 values show up in ascending order from “no high school diploma” to “master’s degree”. To do that, I will change class of ppeduc5 variable from character to factor.

Code
abc.poll <- abc.poll %>% mutate(ppeduc5 = factor(ppeduc5, 
                       levels=c("No high school diploma or GED",
                                "High school graduate (high school diploma or the equivalent GED)",
                                "Some college or Associate degree",
                                "Bachelor's degree",
                                "Master's degree or above")))
                                
class(abc.poll$ppeduc5)
[1] "factor"
Code
table(abc.poll$ppeduc5)

                                   No high school diploma or GED 
                                                              29 
High school graduate (high school diploma or the equivalent GED) 
                                                             133 
                                Some college or Associate degree 
                                                             158 
                                               Bachelor's degree 
                                                             108 
                                        Master's degree or above 
                                                              99 

We can do the same class change for many of the variables so that their values could be put in a order properly based on common sense. These variables are ppeducat, ppinc7 and ppemploy.

Code
unique(abc.poll$ppeducat)
[1] "High school"                "Bachelors degree or higher"
[3] "Some college"               "Less than high school"     
Code
unique(abc.poll$ppinc7)
[1] "$25,000 to $49,999"   "$150,000 or more"     "$100,000 to $149,999"
[4] "$10,000 to $24,999"   "$75,000 to $99,999"   "$50,000 to $74,999"  
[7] "Less than $10,000"   
Code
unique(abc.poll$ppemploy)
[1] "Not working"       "Working part-time" "Working full-time"
Code
abc.poll <- abc.poll %>% mutate(ppeducat = factor(ppeducat, 
                       levels=c("Less than high school",
                                "High school",
                                "Some college",
                                "Bachelors degree or higher")))
                                

abc.poll <- abc.poll %>% mutate(ppinc7 = factor(ppinc7, 
                       levels=c("Less than $10,000",
                                "$10,000 to $24,999",
                                "$25,000 to $49,999",
                                "$50,000 to $74,999",
                                "$75,000 to $99,999",
                                "$100,000 to $149,999",
                                "$150,000 or more")))


abc.poll <- abc.poll %>% mutate(ppemploy = factor(ppemploy, 
                       levels=c("Not working",
                                "Working part-time",
                                "Working full-time")))

On the other hand, some values of pprent and Contact variables are unnecessarily very long strings, They could be shortened for neatness of further analysis on the data.

Code
unique(abc.poll$pprent)
[1] "Owned or being bought by you or someone in your household"
[2] "Occupied without payment of cash rent"                    
[3] "Rented for cash"                                          
Code
unique(abc.poll$Contact)
[1] "No, I am not willing to be interviewed"
[2] "Yes, I am willing to be interviewed"   
Code
abc.poll$pprent[startsWith(abc.poll$pprent, "Owned")] <- "Owned by one of the househould"
abc.poll$Contact[startsWith(abc.poll$Contact, "Yes")] <- "Yes"
abc.poll$Contact[startsWith(abc.poll$Contact, "No")] <- "No"

abc.poll <- rename(abc.poll, willingness_to_contact= Contact)

sample_n(abc.poll, 10)
# A tibble: 10 × 30
        id xspanish ppage ppeduc5  ppedu…¹ ppgen…² ppethm pphhs…³ ppinc7 ppmar…⁴
     <dbl> <chr>    <dbl> <fct>    <fct>   <chr>   <chr>  <chr>   <fct>  <chr>  
 1 7230420 English     66 Some co… Some c… Male    White… 2       $25,0… Divorc…
 2 7230428 English     35 Some co… Some c… Male    2+ Ra… 2       $50,0… Now Ma…
 3 7230179 English     66 Some co… Some c… Male    White… 2       $150,… Now Ma…
 4 7230520 English     42 Some co… Some c… Female  White… 4       $100,… Now Ma…
 5 7230507 English     68 High sc… High s… Female  Black… 2       Less … Never …
 6 7230394 English     59 Bachelo… Bachel… Female  White… 2       $75,0… Now Ma…
 7 7230337 English     38 Master'… Bachel… Male    White… 5       $150,… Now Ma…
 8 7230274 English     74 Some co… Some c… Female  White… 2       $100,… Now Ma…
 9 7230242 English     58 Master'… Bachel… Female  White… 3       $150,… Now Ma…
10 7230167 English     48 Some co… Some c… Female  White… 4       $25,0… Now Ma…
# … with 20 more variables: ppmsacat <chr>, ppreg4 <chr>, pprent <chr>,
#   ppstaten <chr>, PPWORKA <chr>, ppemploy <fct>, Q1_a <chr>, Q1_b <chr>,
#   Q1_c <chr>, Q1_d <chr>, Q1_e <chr>, Q1_f <chr>, Q2 <chr>, Q3 <chr>,
#   Q4 <chr>, Q5 <chr>, QPID <chr>, ABCAGE <chr>, willingness_to_contact <chr>,
#   weights_pid <dbl>, and abbreviated variable names ¹​ppeducat, ²​ppgender,
#   ³​pphhsize, ⁴​ppmarit5
Code
print(dfSummary(abc.poll, 
                varnumbers= FALSE, 
                plain.ascii= FALSE, 
                style= "grid", 
                graph.magnif= 0.80, 
                valid.col= FALSE),
      method= 'render', 
      table.classes= 'table-condensed')

Data Frame Summary

abc.poll

Dimensions: 527 x 30
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
id [numeric]
Mean (sd) : 7230264 (152.3)
min ≤ med ≤ max:
7230001 ≤ 7230264 ≤ 7230527
IQR (CV) : 263 (0)
527 distinct values 0 (0.0%)
xspanish [character]
1. English
2. Spanish
514(97.5%)
13(2.5%)
0 (0.0%)
ppage [numeric]
Mean (sd) : 53.4 (17.1)
min ≤ med ≤ max:
18 ≤ 55 ≤ 91
IQR (CV) : 27 (0.3)
72 distinct values 0 (0.0%)
ppeduc5 [factor]
1. No high school diploma or
2. High school graduate (hig
3. Some college or Associate
4. Bachelor's degree
5. Master's degree or above
29(5.5%)
133(25.2%)
158(30.0%)
108(20.5%)
99(18.8%)
0 (0.0%)
ppeducat [factor]
1. Less than high school
2. High school
3. Some college
4. Bachelors degree or highe
29(5.5%)
133(25.2%)
158(30.0%)
207(39.3%)
0 (0.0%)
ppgender [character]
1. Female
2. Male
254(48.2%)
273(51.8%)
0 (0.0%)
ppethm [character]
1. 2+ Races, Non-Hispanic
2. Black, Non-Hispanic
3. Hispanic
4. Other, Non-Hispanic
5. White, Non-Hispanic
21(4.0%)
27(5.1%)
51(9.7%)
24(4.6%)
404(76.7%)
0 (0.0%)
pphhsize [character]
1. 1
2. 2
3. 3
4. 4
5. 5
6. 6 or more
80(15.2%)
219(41.6%)
102(19.4%)
76(14.4%)
35(6.6%)
15(2.8%)
0 (0.0%)
ppinc7 [factor]
1. Less than $10,000
2. $10,000 to $24,999
3. $25,000 to $49,999
4. $50,000 to $74,999
5. $75,000 to $99,999
6. $100,000 to $149,999
7. $150,000 or more
17(3.2%)
32(6.1%)
82(15.6%)
85(16.1%)
69(13.1%)
105(19.9%)
137(26.0%)
0 (0.0%)
ppmarit5 [character]
1. Divorced
2. Never married
3. Now Married
4. Separated
5. Widowed
43(8.2%)
111(21.1%)
337(63.9%)
8(1.5%)
28(5.3%)
0 (0.0%)
ppmsacat [character]
1. Metro area
2. Non-metro area
448(85.0%)
79(15.0%)
0 (0.0%)
ppreg4 [character]
1. MidWest
2. NorthEast
3. South
4. West
118(22.4%)
93(17.6%)
190(36.1%)
126(23.9%)
0 (0.0%)
pprent [character]
1. Occupied without payment
2. Owned by one of the house
3. Rented for cash
10(1.9%)
406(77.0%)
111(21.1%)
0 (0.0%)
ppstaten [character]
1. California
2. Texas
3. Florida
4. Pennsylvania
5. Illinois
6. New Jersey
7. Ohio
8. Michigan
9. New York
10. Washington
[ 39 others ]
51(9.7%)
42(8.0%)
34(6.5%)
28(5.3%)
23(4.4%)
21(4.0%)
21(4.0%)
18(3.4%)
18(3.4%)
18(3.4%)
253(48.0%)
0 (0.0%)
PPWORKA [character]
1. Currently laid off
2. Employed full-time (by so
3. Employed part-time (by so
4. Full Time Student
5. Homemaker
6. On furlough
7. Other
8. Retired
9. Self-employed
13(2.5%)
220(41.7%)
31(5.9%)
8(1.5%)
37(7.0%)
1(0.2%)
20(3.8%)
165(31.3%)
32(6.1%)
0 (0.0%)
ppemploy [factor]
1. Not working
2. Working part-time
3. Working full-time
221(41.9%)
61(11.6%)
245(46.5%)
0 (0.0%)
Q1_a [character]
1. Approve
2. Disapprove
3. Skipped
329(62.4%)
193(36.6%)
5(0.9%)
0 (0.0%)
Q1_b [character]
1. Approve
2. Disapprove
3. Skipped
192(36.4%)
322(61.1%)
13(2.5%)
0 (0.0%)
Q1_c [character]
1. Approve
2. Disapprove
3. Skipped
272(51.6%)
248(47.1%)
7(1.3%)
0 (0.0%)
Q1_d [character]
1. Approve
2. Disapprove
3. Skipped
192(36.4%)
321(60.9%)
14(2.7%)
0 (0.0%)
Q1_e [character]
1. Approve
2. Disapprove
3. Skipped
212(40.2%)
301(57.1%)
14(2.7%)
0 (0.0%)
Q1_f [character]
1. Approve
2. Disapprove
3. Skipped
281(53.3%)
230(43.6%)
16(3.0%)
0 (0.0%)
Q2 [character]
1. Not concerned at all
2. Not so concerned
3. Somewhat concerned
4. Very concerned
65(12.3%)
147(27.9%)
221(41.9%)
94(17.8%)
0 (0.0%)
Q3 [character]
1. No
2. Skipped
3. Yes
107(20.3%)
5(0.9%)
415(78.7%)
0 (0.0%)
Q4 [character]
1. Excellent
2. Good
3. Not so good
4. Poor
5. Skipped
60(11.4%)
215(40.8%)
97(18.4%)
149(28.3%)
6(1.1%)
0 (0.0%)
Q5 [character]
1. Optimistic
2. Pessimistic
3. Skipped
229(43.5%)
295(56.0%)
3(0.6%)
0 (0.0%)
QPID [character]
1. A Democrat
2. A Republican
3. An Independent
4. Skipped
5. Something else
176(33.4%)
152(28.8%)
168(31.9%)
3(0.6%)
28(5.3%)
0 (0.0%)
ABCAGE [character]
1. 18-29
2. 30-49
3. 50-64
4. 65+
60(11.4%)
148(28.1%)
157(29.8%)
162(30.7%)
0 (0.0%)
willingness_to_contact [character]
1. No
2. Yes
355(67.4%)
172(32.6%)
0 (0.0%)
weights_pid [numeric]
Mean (sd) : 1 (0.6)
min ≤ med ≤ max:
0.3 ≤ 0.8 ≤ 6.3
IQR (CV) : 0.5 (0.6)
453 distinct values 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-20

Source Code
---
title: "Challenge-4"
author: "Said Arslan"
desription: "More data wrangling: pivoting"
date: "10/12/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_4
  - abc_poll

---

```{r}
library(tidyverse)
library(summarytools)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

```


## Read Data

```{r}
abc.poll <- read_csv("_data/abc_poll_2021.csv")
head(abc.poll)
sample_n(abc.poll, 10)

```

## Briefly describe the data


```{r}
print(dfSummary(abc.poll, 
                varnumbers= FALSE, 
                plain.ascii= FALSE, 
                style= "grid", 
                graph.magnif= 0.80, 
                valid.col= FALSE),
      method= 'render', 
      table.classes= 'table-condensed')


```


The dataset should be from a political survey. There are 527 rows and 31 columns. Each row(observation) contains information about a survey respondent as well as his/her answers to survey questions.
3 of 31 variables are 'numeric' variables which are `id`, `ppage` and `weights_pid`. All of the rest are 'character' variables.
Actually most of these character variables should be coded as 'factor' variable because they are essentially categorical variables that could have a few possible values.

All variable names that start with "Q" are survey questions.


```{r}
colnames(select(abc.poll, starts_with("Q")))
length(colnames(select(abc.poll, starts_with("Q"))))

```

So, there are 11 survey questions.

All variable names that start with "pp" contain demographic information about respondents.


```{r}
colnames(select(abc.poll, starts_with("pp")))
length(colnames(select(abc.poll, starts_with("pp"))))

```

So, there 14 variables identifying respondents' demographic characteristics.
variable `complete_status` could be dropped from the dataset as all observations in the dataset have same value, 'qualified.



```{r}
abc.poll <- abc.poll %>% select(-complete_status)

```


When we look at the Data Frame Summary table above, we can see that there are no missing values in the dataset; however, two  values of `ppeduc5` variable cannot be displayed and identified as NA. 


```{r}
table(abc.poll$ppeduc5)

```

As it can be seen above, there is an issue with properly reading these two string values. Their correct values should be "Bachelor's degree" and "Master's degree or above". Let me fix it.


```{r}
abc.poll$ppeduc5[startsWith(abc.poll$ppeduc5, "Bac")] <- "Bachelor's degree"
abc.poll$ppeduc5[startsWith(abc.poll$ppeduc5, "Mas")] <- "Master's degree or above"

table(abc.poll$ppeduc5)

```

Also it would be better if we redefine `ppeduc5` variable so that its 5 values show up in ascending order from "no high school diploma" to "master's degree". To do that, I will change class of `ppeduc5` variable from character to factor. 



```{r}
abc.poll <- abc.poll %>% mutate(ppeduc5 = factor(ppeduc5, 
                       levels=c("No high school diploma or GED",
                                "High school graduate (high school diploma or the equivalent GED)",
                                "Some college or Associate degree",
                                "Bachelor's degree",
                                "Master's degree or above")))
                                
class(abc.poll$ppeduc5)
table(abc.poll$ppeduc5)

```

We can do the same class change for many of the variables so that their values could be put in a order properly based on common sense. These variables are  `ppeducat`, `ppinc7` and  `ppemploy`.


```{r}
unique(abc.poll$ppeducat)
unique(abc.poll$ppinc7)
unique(abc.poll$ppemploy)

```


```{r}
abc.poll <- abc.poll %>% mutate(ppeducat = factor(ppeducat, 
                       levels=c("Less than high school",
                                "High school",
                                "Some college",
                                "Bachelors degree or higher")))
                                

abc.poll <- abc.poll %>% mutate(ppinc7 = factor(ppinc7, 
                       levels=c("Less than $10,000",
                                "$10,000 to $24,999",
                                "$25,000 to $49,999",
                                "$50,000 to $74,999",
                                "$75,000 to $99,999",
                                "$100,000 to $149,999",
                                "$150,000 or more")))


abc.poll <- abc.poll %>% mutate(ppemploy = factor(ppemploy, 
                       levels=c("Not working",
                                "Working part-time",
                                "Working full-time")))


```


On the other hand, some values of `pprent` and `Contact` variables are unnecessarily very long strings, They could be shortened for neatness of further analysis on the data.



```{r}
unique(abc.poll$pprent)
unique(abc.poll$Contact)

```


```{r}
abc.poll$pprent[startsWith(abc.poll$pprent, "Owned")] <- "Owned by one of the househould"
abc.poll$Contact[startsWith(abc.poll$Contact, "Yes")] <- "Yes"
abc.poll$Contact[startsWith(abc.poll$Contact, "No")] <- "No"

abc.poll <- rename(abc.poll, willingness_to_contact= Contact)

sample_n(abc.poll, 10)

```


```{r}
print(dfSummary(abc.poll, 
                varnumbers= FALSE, 
                plain.ascii= FALSE, 
                style= "grid", 
                graph.magnif= 0.80, 
                valid.col= FALSE),
      method= 'render', 
      table.classes= 'table-condensed')

```