DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 6

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Challenge Overview
  • Read in data
    • Briefly describe the data
  • Tidy Data (as needed)
  • Time Dependent Visualization
  • Visualizing Part-Whole Relationships

Challenge 6

challenge_6
hotel_bookings
air_bnb
fed_rate
debt
usa_households
abc_poll
Visualizing Time and Relationships
Author

Lai Wei

Published

November 19, 2022

library(tidyverse)
library(ggplot2)
library(readxl)
library(lubridate)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. tidy data (as needed, including sanity checks)
  3. mutate variables as needed (including sanity checks)
  4. create at least one graph including time (evolution)
  • try to make them “publication” ready (optional)
  • Explain why you choose the specific graph type
  1. Create at least one graph depicting part-whole or flow relationships
  • try to make them “publication” ready (optional)
  • Explain why you choose the specific graph type

R Graph Gallery is a good starting point for thinking about what information is conveyed in standard graph types, and includes example R code.

(be sure to only include the category tags for the data you use!)

Read in data

  • abc_poll ⭐⭐⭐
rate <- read_csv("_data/abc_poll_2021.csv")
rate
# A tibble: 527 × 31
        id xspanish comple…¹ ppage ppeduc5 ppedu…² ppgen…³ ppethm pphhs…⁴ ppinc7
     <dbl> <chr>    <chr>    <dbl> <chr>   <chr>   <chr>   <chr>  <chr>   <chr> 
 1 7230001 English  qualifi…    68 "High … High s… Female  White… 2       $25,0…
 2 7230002 English  qualifi…    85 "Bache… Bachel… Male    White… 2       $150,…
 3 7230003 English  qualifi…    69 "High … High s… Male    White… 2       $100,…
 4 7230004 English  qualifi…    74 "Bache… Bachel… Female  White… 1       $25,0…
 5 7230005 English  qualifi…    77 "High … High s… Male    White… 3       $10,0…
 6 7230006 English  qualifi…    70 "Bache… Bachel… Male    White… 2       $75,0…
 7 7230007 English  qualifi…    26 "Maste… Bachel… Male    Other… 3       $150,…
 8 7230008 English  qualifi…    76 "Bache… Bachel… Male    Black… 2       $50,0…
 9 7230009 English  qualifi…    78 "Bache… Bachel… Female  White… 2       $150,…
10 7230010 English  qualifi…    47 "Maste… Bachel… Male    Other… 4       $150,…
# … with 517 more rows, 21 more variables: ppmarit5 <chr>, ppmsacat <chr>,
#   ppreg4 <chr>, pprent <chr>, ppstaten <chr>, PPWORKA <chr>, ppemploy <chr>,
#   Q1_a <chr>, Q1_b <chr>, Q1_c <chr>, Q1_d <chr>, Q1_e <chr>, Q1_f <chr>,
#   Q2 <chr>, Q3 <chr>, Q4 <chr>, Q5 <chr>, QPID <chr>, ABCAGE <chr>,
#   Contact <chr>, weights_pid <dbl>, and abbreviated variable names
#   ¹​complete_status, ²​ppeducat, ³​ppgender, ⁴​pphhsize

Briefly describe the data

Tidy Data (as needed)

Get the colnums name.

colnames(rate)
 [1] "id"              "xspanish"        "complete_status" "ppage"          
 [5] "ppeduc5"         "ppeducat"        "ppgender"        "ppethm"         
 [9] "pphhsize"        "ppinc7"          "ppmarit5"        "ppmsacat"       
[13] "ppreg4"          "pprent"          "ppstaten"        "PPWORKA"        
[17] "ppemploy"        "Q1_a"            "Q1_b"            "Q1_c"           
[21] "Q1_d"            "Q1_e"            "Q1_f"            "Q2"             
[25] "Q3"              "Q4"              "Q5"              "QPID"           
[29] "ABCAGE"          "Contact"         "weights_pid"    

From this section, I made a new proportion table to show the demographic information, races percentage in this survey. Obviously, the White, non-Hispanic is the largest group.

rate_1 <- rate %>% 
  select(xspanish, starts_with("pp")) 
prop.table(table(rate_1$ppethm))

2+ Races, Non-Hispanic    Black, Non-Hispanic               Hispanic 
            0.03984820             0.05123340             0.09677419 
   Other, Non-Hispanic    White, Non-Hispanic 
            0.04554080             0.76660342 
rate_ethm <- rate %>% 
  mutate(Ethm = ifelse(ppethm == "White, Non-Hispanic", "non-ethnic minorities", "ethnic minorities"))

rate_ethm
# A tibble: 527 × 32
        id xspanish comple…¹ ppage ppeduc5 ppedu…² ppgen…³ ppethm pphhs…⁴ ppinc7
     <dbl> <chr>    <chr>    <dbl> <chr>   <chr>   <chr>   <chr>  <chr>   <chr> 
 1 7230001 English  qualifi…    68 "High … High s… Female  White… 2       $25,0…
 2 7230002 English  qualifi…    85 "Bache… Bachel… Male    White… 2       $150,…
 3 7230003 English  qualifi…    69 "High … High s… Male    White… 2       $100,…
 4 7230004 English  qualifi…    74 "Bache… Bachel… Female  White… 1       $25,0…
 5 7230005 English  qualifi…    77 "High … High s… Male    White… 3       $10,0…
 6 7230006 English  qualifi…    70 "Bache… Bachel… Male    White… 2       $75,0…
 7 7230007 English  qualifi…    26 "Maste… Bachel… Male    Other… 3       $150,…
 8 7230008 English  qualifi…    76 "Bache… Bachel… Male    Black… 2       $50,0…
 9 7230009 English  qualifi…    78 "Bache… Bachel… Female  White… 2       $150,…
10 7230010 English  qualifi…    47 "Maste… Bachel… Male    Other… 4       $150,…
# … with 517 more rows, 22 more variables: ppmarit5 <chr>, ppmsacat <chr>,
#   ppreg4 <chr>, pprent <chr>, ppstaten <chr>, PPWORKA <chr>, ppemploy <chr>,
#   Q1_a <chr>, Q1_b <chr>, Q1_c <chr>, Q1_d <chr>, Q1_e <chr>, Q1_f <chr>,
#   Q2 <chr>, Q3 <chr>, Q4 <chr>, Q5 <chr>, QPID <chr>, ABCAGE <chr>,
#   Contact <chr>, weights_pid <dbl>, Ethm <chr>, and abbreviated variable
#   names ¹​complete_status, ²​ppeducat, ³​ppgender, ⁴​pphhsize

Time Dependent Visualization

In this section, I made a graphs by ggplot in different ages group. And we can see even though the age range is wide, the data distribution is quite similar. People’s opinion do not change too much as time going by.

rate %>% 
  ggplot(aes(Q2)) + geom_bar() + theme_bw() + 
  facet_wrap(vars(ABCAGE), scales = "free")

Visualizing Part-Whole Relationships

After adding fill section in ggplot, it shows different colors on every bar based on respondents’ political tendency.

rate %>% 
  ggplot(aes(Q2, fill = QPID)) + geom_bar() + theme_bw() + 
  facet_wrap(vars(ABCAGE), scales = "free")

Source Code
---
title: "Challenge 6"
author: "Lai Wei"
description: "Visualizing Time and Relationships"
date: "11/19/2022"
format:
  html:
    toc: true
    code-copy: true
    code-tools: true
categories:
  - challenge_6
  - hotel_bookings
  - air_bnb
  - fed_rate
  - debt
  - usa_households
  - abc_poll
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)
library(ggplot2)
library(readxl)
library(lubridate)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

## Challenge Overview

Today's challenge is to:

1)  read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
2)  tidy data (as needed, including sanity checks)
3)  mutate variables as needed (including sanity checks)
4)  create at least one graph including time (evolution)
   - try to make them "publication" ready (optional)
   - Explain why you choose the specific graph type
5)  Create at least one graph depicting part-whole or flow relationships
   - try to make them "publication" ready (optional)
   - Explain why you choose the specific graph type

[R Graph Gallery](https://r-graph-gallery.com/) is a good starting point for thinking about what information is conveyed in standard graph types, and includes example R code.

(be sure to only include the category tags for the data you use!)

## Read in data

  - abc_poll ⭐⭐⭐

```{r}
rate <- read_csv("_data/abc_poll_2021.csv")
rate
```

### Briefly describe the data

## Tidy Data (as needed)

Get the colnums name. 
```{r}
colnames(rate)
```
From this section, I made a new proportion table to show the demographic information, races percentage in this survey. Obviously, the White, non-Hispanic is the largest group.  
```{r}
rate_1 <- rate %>% 
  select(xspanish, starts_with("pp")) 
prop.table(table(rate_1$ppethm))
```



```{r}
rate_ethm <- rate %>% 
  mutate(Ethm = ifelse(ppethm == "White, Non-Hispanic", "non-ethnic minorities", "ethnic minorities"))

rate_ethm
```

## Time Dependent Visualization

In this section, I made a graphs by ggplot in different ages group. And we can see even though the age range is wide, the data distribution is quite similar. People's opinion do not change too much as time going by. 
```{r}
rate %>% 
  ggplot(aes(Q2)) + geom_bar() + theme_bw() + 
  facet_wrap(vars(ABCAGE), scales = "free")
```

## Visualizing Part-Whole Relationships

After adding fill section in ggplot, it shows different colors on every bar based on respondents' political tendency. 
```{r}
rate %>% 
  ggplot(aes(Q2, fill = QPID)) + geom_bar() + theme_bw() + 
  facet_wrap(vars(ABCAGE), scales = "free")
```