library(tidyverse)
library(ggplot2)
library(lubridate)
library(dplyr)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Challenge 7
Challenge Overview
Today’s challenge is to:
- read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
- tidy data (as needed, including sanity checks)
- mutate variables as needed (including sanity checks)
- Recreate at least two graphs from previous exercises, but introduce at least one additional dimension that you omitted before using ggplot functionality (color, shape, line, facet, etc) The goal is not to create unneeded chart ink (Tufte), but to concisely capture variation in additional dimensions that were collapsed in your earlier 2 or 3 dimensional graphs.
- Explain why you choose the specific graph type
- If you haven’t tried in previous weeks, work this week to make your graphs “publication” ready with titles, captions, and pretty axis labels and other viewer-friendly features
R Graph Gallery is a good starting point for thinking about what information is conveyed in standard graph types, and includes example R code. And anyone not familiar with Edward Tufte should check out his fantastic books and courses on data visualizaton.
(be sure to only include the category tags for the data you use!)
Read in data
Read in one (or more) of the following datasets, using the correct R package and command.
- eggs ⭐
- abc_poll ⭐⭐
- australian_marriage ⭐⭐
- hotel_bookings ⭐⭐⭐
- air_bnb ⭐⭐⭐
- us_hh ⭐⭐⭐⭐
- faostat ⭐⭐⭐⭐⭐
# Read in the data
<- read.csv("_data/abc_poll_2021.csv")
df head(df)
id xspanish complete_status ppage
1 7230001 English qualified 68
2 7230002 English qualified 85
3 7230003 English qualified 69
4 7230004 English qualified 74
5 7230005 English qualified 77
6 7230006 English qualified 70
ppeduc5
1 High school graduate (high school diploma or the equivalent GED)
2 Bachelor\x92s degree
3 High school graduate (high school diploma or the equivalent GED)
4 Bachelor\x92s degree
5 High school graduate (high school diploma or the equivalent GED)
6 Bachelor\x92s degree
ppeducat ppgender ppethm pphhsize
1 High school Female White, Non-Hispanic 2
2 Bachelors degree or higher Male White, Non-Hispanic 2
3 High school Male White, Non-Hispanic 2
4 Bachelors degree or higher Female White, Non-Hispanic 1
5 High school Male White, Non-Hispanic 3
6 Bachelors degree or higher Male White, Non-Hispanic 2
ppinc7 ppmarit5 ppmsacat ppreg4
1 $25,000 to $49,999 Now Married Metro area South
2 $150,000 or more Now Married Metro area South
3 $100,000 to $149,999 Now Married Metro area South
4 $25,000 to $49,999 Divorced Metro area NorthEast
5 $10,000 to $24,999 Now Married Metro area MidWest
6 $75,000 to $99,999 Now Married Metro area MidWest
pprent ppstaten
1 Owned or being bought by you or someone in your household Florida
2 Owned or being bought by you or someone in your household Kentucky
3 Owned or being bought by you or someone in your household Florida
4 Owned or being bought by you or someone in your household Pennsylvania
5 Owned or being bought by you or someone in your household Michigan
6 Owned or being bought by you or someone in your household Missouri
PPWORKA ppemploy Q1_a Q1_b
1 Retired Not working Approve Approve
2 Retired Not working Approve Approve
3 Retired Not working Disapprove Disapprove
4 Retired Not working Approve Approve
5 Retired Not working Approve Approve
6 Employed part-time (by someone else) Working part-time Approve Approve
Q1_c Q1_d Q1_e Q1_f Q2 Q3 Q4
1 Disapprove Approve Approve Disapprove Not so concerned Yes Good
2 Approve Approve Approve Approve Somewhat concerned Yes Good
3 Disapprove Disapprove Disapprove Approve Very concerned Yes Poor
4 Approve Approve Approve Approve Somewhat concerned Yes Excellent
5 Approve Approve Approve Approve Somewhat concerned Yes Excellent
6 Approve Approve Disapprove Approve Not so concerned Yes Excellent
Q5 QPID ABCAGE Contact
1 Optimistic A Democrat 65+ No, I am not willing to be interviewed
2 Optimistic An Independent 65+ No, I am not willing to be interviewed
3 Pessimistic Something else 65+ No, I am not willing to be interviewed
4 Pessimistic An Independent 65+ Yes, I am willing to be interviewed
5 Optimistic A Democrat 65+ No, I am not willing to be interviewed
6 Optimistic A Democrat 65+ No, I am not willing to be interviewed
weights_pid
1 0.6382
2 0.5493
3 0.8488
4 0.8126
5 0.4994
6 0.4043
Briefly describe the data
The dataset consists of 527 observations, each representing an individual. The variables in the dataset include demographic information such as age, gender, ethnicity, education level, household size, income range, marital status, and region of residence. The dataset also includes variables related to work status, employment, and political affiliation. Additionally, there are multiple-choice questions and responses, along with measures of concern, self-perceived quality of life, and optimism/pessimism about the future. The dataset also contains information about the individual’s willingness to be interviewed and weights assigned to each observation. The dataset provides a comprehensive overview of various demographic and socio-political factors for a sample of individuals. It offers valuable insights into the characteristics and perspectives of the respondents, making it suitable for analyzing relationships and patterns among the variables and drawing conclusions about the population they represent.
dim(df)
[1] 527 31
summary(df)
id xspanish complete_status ppage
Min. :7230001 Length:527 Length:527 Min. :18.00
1st Qu.:7230132 Class :character Class :character 1st Qu.:40.00
Median :7230264 Mode :character Mode :character Median :55.00
Mean :7230264 Mean :53.39
3rd Qu.:7230396 3rd Qu.:67.00
Max. :7230527 Max. :91.00
ppeduc5 ppeducat ppgender ppethm
Length:527 Length:527 Length:527 Length:527
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
pphhsize ppinc7 ppmarit5 ppmsacat
Length:527 Length:527 Length:527 Length:527
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
ppreg4 pprent ppstaten PPWORKA
Length:527 Length:527 Length:527 Length:527
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
ppemploy Q1_a Q1_b Q1_c
Length:527 Length:527 Length:527 Length:527
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Q1_d Q1_e Q1_f Q2
Length:527 Length:527 Length:527 Length:527
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Q3 Q4 Q5 QPID
Length:527 Length:527 Length:527 Length:527
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
ABCAGE Contact weights_pid
Length:527 Length:527 Min. :0.3240
Class :character Class :character 1st Qu.:0.6332
Mode :character Mode :character Median :0.8451
Mean :1.0000
3rd Qu.:1.1516
Max. :6.2553
# Get unique values for the ppstaten column
<- unique(df$ppstaten)
unique_ppstaten cat(paste(unique_ppstaten, collapse = ", "))
Florida, Kentucky, Pennsylvania, Michigan, Missouri, New Jersey, Georgia, Washington, New Hampshire, Maryland, California, Arizona, Texas, North Dakota, Connecticut, Massachusetts, Oregon, Illinois, Arkansas, South Carolina, Tennessee, Hawaii, Kansas, Colorado, Louisiana, Virginia, Wisconsin, Vermont, Ohio, North Carolina, New York, Nevada, Nebraska, Idaho, Alabama, Minnesota, Montana, Utah, New Mexico, Oklahoma, Indiana, South Dakota, Mississippi, Delaware, District of Columbia, West Virginia, Wyoming, Maine, Iowa
str(df)
'data.frame': 527 obs. of 31 variables:
$ id : int 7230001 7230002 7230003 7230004 7230005 7230006 7230007 7230008 7230009 7230010 ...
$ xspanish : chr "English" "English" "English" "English" ...
$ complete_status: chr "qualified" "qualified" "qualified" "qualified" ...
$ ppage : int 68 85 69 74 77 70 26 76 78 47 ...
$ ppeduc5 : chr "High school graduate (high school diploma or the equivalent GED)" "Bachelor\x92s degree" "High school graduate (high school diploma or the equivalent GED)" "Bachelor\x92s degree" ...
$ ppeducat : chr "High school" "Bachelors degree or higher" "High school" "Bachelors degree or higher" ...
$ ppgender : chr "Female" "Male" "Male" "Female" ...
$ ppethm : chr "White, Non-Hispanic" "White, Non-Hispanic" "White, Non-Hispanic" "White, Non-Hispanic" ...
$ pphhsize : chr "2" "2" "2" "1" ...
$ ppinc7 : chr "$25,000 to $49,999" "$150,000 or more" "$100,000 to $149,999" "$25,000 to $49,999" ...
$ ppmarit5 : chr "Now Married" "Now Married" "Now Married" "Divorced" ...
$ ppmsacat : chr "Metro area" "Metro area" "Metro area" "Metro area" ...
$ ppreg4 : chr "South" "South" "South" "NorthEast" ...
$ pprent : chr "Owned or being bought by you or someone in your household" "Owned or being bought by you or someone in your household" "Owned or being bought by you or someone in your household" "Owned or being bought by you or someone in your household" ...
$ ppstaten : chr "Florida" "Kentucky" "Florida" "Pennsylvania" ...
$ PPWORKA : chr "Retired" "Retired" "Retired" "Retired" ...
$ ppemploy : chr "Not working" "Not working" "Not working" "Not working" ...
$ Q1_a : chr "Approve" "Approve" "Disapprove" "Approve" ...
$ Q1_b : chr "Approve" "Approve" "Disapprove" "Approve" ...
$ Q1_c : chr "Disapprove" "Approve" "Disapprove" "Approve" ...
$ Q1_d : chr "Approve" "Approve" "Disapprove" "Approve" ...
$ Q1_e : chr "Approve" "Approve" "Disapprove" "Approve" ...
$ Q1_f : chr "Disapprove" "Approve" "Approve" "Approve" ...
$ Q2 : chr "Not so concerned" "Somewhat concerned" "Very concerned" "Somewhat concerned" ...
$ Q3 : chr "Yes" "Yes" "Yes" "Yes" ...
$ Q4 : chr "Good" "Good" "Poor" "Excellent" ...
$ Q5 : chr "Optimistic" "Optimistic" "Pessimistic" "Pessimistic" ...
$ QPID : chr "A Democrat" "An Independent" "Something else" "An Independent" ...
$ ABCAGE : chr "65+" "65+" "65+" "65+" ...
$ Contact : chr "No, I am not willing to be interviewed" "No, I am not willing to be interviewed" "No, I am not willing to be interviewed" "Yes, I am willing to be interviewed" ...
$ weights_pid : num 0.638 0.549 0.849 0.813 0.499 ...
Tidy Data (as needed)
Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.
Taking the selected columns Q1_a, Q1_b, Q1_c, Q1_d, Q1_e, and Q1_f and transforming them into two new columns: Question and Response. The Question column contains the names of the original columns, and the Response column contains the corresponding values. It becomes easier to perform operations such as grouping, filtering, and plotting.
<- df %>%
tidy_data select(ppgender, ppethm, ppmarit5, ppinc7, ppstaten, Q1_a, Q1_b, Q1_c, Q1_d, Q1_e, Q1_f, Q2, Q3, Q4, Q5) %>%
pivot_longer(cols = c(Q1_a, Q1_b, Q1_c, Q1_d, Q1_e, Q1_f),
names_to = "Question",
values_to = "Response") %>%
filter(!is.na(Response))
head(tidy_data %>% select(Question, Response))
# A tibble: 6 × 2
Question Response
<chr> <chr>
1 Q1_a Approve
2 Q1_b Approve
3 Q1_c Disapprove
4 Q1_d Approve
5 Q1_e Approve
6 Q1_f Disapprove
Are there any variables that require mutation to be usable in your analysis stream? For example, do you need to calculate new values in order to graph them? Can string values be represented numerically? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?
Creating a new column called “ppage_category” the cut() function. This new column categorizes the values in the “ppage” column into different age groups based on specified intervals. The resulting “ppage_category” column provides a categorical representation of the age groups: “Young”, “Middle-aged”, “Senior”, and “Unknown”.
<- df[, c("id", "ppage", "ppeducat", "ppgender", "ppethm", "pphhsize", "ppinc7")]
selected_data
$ppage_category <- cut(selected_data$ppage,
selected_databreaks = c(0, 30, 50, 70, Inf),
labels = c("Young", "Middle-aged", "Senior", "Unknown"))
Document your work here.
Visualization with Multiple Dimensions
creating two visualizations. The first is a bar chart showing the count of responses for each question. The second is a stacked bar chart displaying the count of responses for each question, grouped by gender and ethnicity.These visualizations allow us to explore the distribution of responses across different questions and examine how it varies by gender and ethnicity.
# Create a bar chart of count by Question and Response
ggplot(tidy_data, aes(x = Question, fill = Response)) +
geom_bar() +
labs(title = "Count by Question and Response",
x = "Question",
y = "Count",
fill = "Response") +
theme_bw()
# Create a stacked bar chart of count by Question response Approve, grouped by Gender and Ethnicity
ggplot(tidy_data %>%filter(Response == "Approve"), aes(x = Question, fill = ppethm)) +
geom_bar(position = "fill") +
facet_wrap(~ppgender) +
labs(title = "Count by Question, Gender, and Ethnicity",
x = "Question",
y = "Proportion",
fill = "Ethnicity") +
theme_bw()