library(tidyverse)
library(ggplot2)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Challenge 6
Challenge Overview
Today’s challenge is to:
- read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
- tidy data (as needed, including sanity checks)
- mutate variables as needed (including sanity checks)
- create at least one graph including time (evolution)
- try to make them “publication” ready (optional)
- Explain why you choose the specific graph type
- Create at least one graph depicting part-whole or flow relationships
- try to make them “publication” ready (optional)
- Explain why you choose the specific graph type
R Graph Gallery is a good starting point for thinking about what information is conveyed in standard graph types, and includes example R code.
Read in data
Read in one (or more) of the following datasets, using the correct R package and command.
- debt ⭐
- fed_rate ⭐⭐
- abc_poll ⭐⭐⭐
- usa_hh ⭐⭐⭐
- hotel_bookings ⭐⭐⭐⭐
- AB_NYC ⭐⭐⭐⭐⭐
setwd(“C:/Github Projects/601_Spring_2023/posts/_data”) poll <- read.csv(“abc_poll_2021.csv”)
head(poll)
Error in head(poll): object 'poll' not found
View(poll)
Error in as.data.frame(x): object 'poll' not found
Briefly describe the data
Tidy Data (as needed)
First off, below I will rename all the columns to actually be usable. Some of them such as ppmsacat aren’t impossible to identify, but the layman would have no idea what that means. Additionally, at first glance, this data is 31 variables and 527 observations. I would be inclined to assume those behind the server seek to identify trends in political parties. (eg do republicans have less education? Do democrats have higher income?, Do independents live in the suburbs with big families or in cities alone?, etc..)
<- poll %>%
Poll rename(ID = 'id', Degree = 'ppeduc5', Primary_Language = 'xspanish', Page = 'ppage', Education_Level = 'ppeducat', Gender = 'ppgender', Household_Size = 'pphhsize', Ethnicity = 'ppethm', Income = 'ppinc7', Marital_Status = 'ppmarit5', Metro_Stat_Area ='ppmsacat', Region = 'ppreg4', Rental_Status = 'pprent', State = 'ppstaten', Retired ='PPWORKA', Employment_Status = 'ppemploy', Complete_Status = 'complete_status', Political_Affiliation = 'QPID', Age = 'ABCAGE', Interview = 'Contact')
Error in rename(., ID = "id", Degree = "ppeduc5", Primary_Language = "xspanish", : object 'poll' not found
View(Poll)
Error in as.data.frame(x): object 'Poll' not found
Next, I will remove some columns that won’t be helpful for this analysis
$Page <- NULL Poll
Error in Poll$Page <- NULL: object 'Poll' not found
$ID <- NULL Poll
Error in Poll$ID <- NULL: object 'Poll' not found
$Complete_Status <- NULL Poll
Error in Poll$Complete_Status <- NULL: object 'Poll' not found
$weights_pid <- NULL Poll
Error in Poll$weights_pid <- NULL: object 'Poll' not found
# My analysis will look at the working class, so I will filter out any retired individuals
<- Poll %>%
Poll_re select(Retired, Employment_Status, Education_Level, Income, Household_Size, Ethnicity, Metro_Stat_Area, Region, Political_Affiliation, Age) %>%
filter(Retired != 'Retired')
Error in select(., Retired, Employment_Status, Education_Level, Income, : object 'Poll' not found
View(Poll_re)
Error in as.data.frame(x): object 'Poll_re' not found
I want to look at the role education and employment status might play in political affiliation. I am going to hypothesize non-retired, unemployed individuals are less educated, but will explore this below and see where their political affiliation falls (Only democrat vs republican)
<- Poll_re[order(Poll_re$Political_Affiliation), ] %>%
Democrat_Income
select(Income, Political_Affiliation) %>%
filter(Political_Affiliation %in% c('A Democrat',"A Republican"))
Error in select(., Income, Political_Affiliation): object 'Poll_re' not found
View(Democrat_Income)
Error in as.data.frame(x): object 'Democrat_Income' not found
%>%
Democrat_Income ggplot(aes(x=Political_Affiliation)) +
geom_histogram(stat = "count") +
theme_economist() +
labs(title = " Average Income by Political Affiliation")+
ylab('Income') +
xlab('Political Affiliation')
Error in ggplot(., aes(x = Political_Affiliation)): object 'Democrat_Income' not found
= nrow(poll) n
Error in nrow(poll): object 'poll' not found
%>%
Poll select(Education_Level, Political_Affiliation) %>%
filter(Education_Level == 'Less than high school') %>%
count()/n
Error in select(., Education_Level, Political_Affiliation): object 'Poll' not found
29 individuals (or 5.5%) have less than a high school degree
%>%
Poll select(Education_Leveal, Political_Affiliation) %>%
filter(Education_Leveal == 'High school') %>%
count()/n
Error in select(., Education_Leveal, Political_Affiliation): object 'Poll' not found
133 or 25.2% have a high school diploma. I’m interested in seeing what voting preference those with less traditional education have. One step further, we’ll add in political affiliation to the above.
%>%
Poll select(Education_Level, Political_Affiliation) %>%
filter(Education_Level == 'Less than high school') %>%
filter(Political_Affiliation == 'An Independent') %>%
count()
Error in select(., Education_Level, Political_Affiliation): object 'Poll' not found
%>%
Poll select(Education_Level, Political_Affiliation) %>%
filter(Education_Level == 'Less than high school') %>%
filter(Political_Affiliation == 'Skipped') %>%
count()
Error in select(., Education_Level, Political_Affiliation): object 'Poll' not found
%>%
poll select(ppeducat, QPID) %>%
filter(ppeducat == 'Less than high school') %>%
filter(QPID == 'A Democrat') %>%
count()
Error in select(., ppeducat, QPID): object 'poll' not found
%>%
Poll select(Education_Level, Political_Affiliation) %>%
filter(Education_Level == 'Less than high school') %>%
filter(Political_Affiliation == 'A Republican') %>%
count()
Error in select(., Education_Level, Political_Affiliation): object 'Poll' not found
That actually surprises me that for such a small sample the numbers are dispersed about as I would have expected. I’ll do the same thing for people with only a high school degree and see if their affiliation lines up with what I might expect.
%>%
Poll select(Education_Level, Political_Affiliation) %>%
filter(Education_Level == 'High school') %>%
filter(Political_Affiliation == 'An Independent') %>%
count()
Error in select(., Education_Level, Political_Affiliation): object 'Poll' not found
%>%
Poll select(Education_Level, Political_Affiliation) %>%
filter(Education_Level == 'High school') %>%
filter(Political_Affiliation == 'Skipped') %>%
count()
Error in select(., Education_Level, Political_Affiliation): object 'Poll' not found
`
%>%
Poll select(Education_Level, Political_Affiliation) %>%
filter(Education_Level == 'High school') %>%
filter(Political_Affiliation == 'A Republican') %>%
count()
Error in select(., Education_Level, Political_Affiliation): object 'Poll' not found
%>%
Poll select(Education_Level, Political_Affiliation) %>%
filter(Education_Level == 'High school') %>%
filter(Political_Affiliation == 'A Democrat') %>%
count()
Error in select(., Education_Level, Political_Affiliation): object 'Poll' not found
Again, honestly a shockingly clean dispersion. ~40 each for independent, republican, and democrat.
<- c("lightpink", "lightblue","lightyellow", "lightgreen","orange")
mycols
barplot(Poll$Political_Affiliation, names.arg = Poll$Political_Affiliation, ylab = "Age",
Error: <text>:10:0: unexpected end of input
8:
9:
^