Challenge 5 Paritosh

challenge_5
railroads
cereal
air_bnb
pathogen_cost
australian_marriage
public_schools
usa_households
Introduction to Visualization
Author

Paritosh G

Published

May 27, 2023

library(tidyverse)
library(ggplot2)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. tidy data (as needed, including sanity checks)
  3. mutate variables as needed (including sanity checks)
  4. create at least two univariate visualizations
  • try to make them “publication” ready
  • Explain why you choose the specific graph type
  1. Create at least one bivariate visualization
  • try to make them “publication” ready
  • Explain why you choose the specific graph type

R Graph Gallery is a good starting point for thinking about what information is conveyed in standard graph types, and includes example R code.

(be sure to only include the category tags for the data you use!)

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

  • cereal.csv ⭐
  • Total_cost_for_top_15_pathogens_2018.xlsx ⭐
  • Australian Marriage ⭐⭐
  • AB_NYC_2019.csv ⭐⭐⭐
  • StateCounty2012.xls ⭐⭐⭐
  • Public School Characteristics ⭐⭐⭐⭐
  • USA Households ⭐⭐⭐⭐⭐
library(tidyverse)
library(ggplot2)
abn <- read_csv('_data/AB_NYC_2019.csv')

Briefly describe the data

abn %>% 
head(15)
# A tibble: 15 × 16
      id name      host_id host_…¹ neigh…² neigh…³ latit…⁴ longi…⁵ room_…⁶ price
   <dbl> <chr>       <dbl> <chr>   <chr>   <chr>     <dbl>   <dbl> <chr>   <dbl>
 1  2539 Clean & …    2787 John    Brookl… Kensin…    40.6   -74.0 Privat…   149
 2  2595 Skylit M…    2845 Jennif… Manhat… Midtown    40.8   -74.0 Entire…   225
 3  3647 THE VILL…    4632 Elisab… Manhat… Harlem     40.8   -73.9 Privat…   150
 4  3831 Cozy Ent…    4869 LisaRo… Brookl… Clinto…    40.7   -74.0 Entire…    89
 5  5022 Entire A…    7192 Laura   Manhat… East H…    40.8   -73.9 Entire…    80
 6  5099 Large Co…    7322 Chris   Manhat… Murray…    40.7   -74.0 Entire…   200
 7  5121 BlissArt…    7356 Garon   Brookl… Bedfor…    40.7   -74.0 Privat…    60
 8  5178 Large Fu…    8967 Shunic… Manhat… Hell's…    40.8   -74.0 Privat…    79
 9  5203 Cozy Cle…    7490 MaryEl… Manhat… Upper …    40.8   -74.0 Privat…    79
10  5238 Cute & C…    7549 Ben     Manhat… Chinat…    40.7   -74.0 Entire…   150
11  5295 Beautifu…    7702 Lena    Manhat… Upper …    40.8   -74.0 Entire…   135
12  5441 Central …    7989 Kate    Manhat… Hell's…    40.8   -74.0 Privat…    85
13  5803 Lovely R…    9744 Laurie  Brookl… South …    40.7   -74.0 Privat…    89
14  6021 Wonderfu…   11528 Claudio Manhat… Upper …    40.8   -74.0 Privat…    85
15  6090 West Vil…   11975 Alina   Manhat… West V…    40.7   -74.0 Entire…   120
# … with 6 more variables: minimum_nights <dbl>, number_of_reviews <dbl>,
#   last_review <date>, reviews_per_month <dbl>,
#   calculated_host_listings_count <dbl>, availability_365 <dbl>, and
#   abbreviated variable names ¹​host_name, ²​neighbourhood_group,
#   ³​neighbourhood, ⁴​latitude, ⁵​longitude, ⁶​room_type

Univariate Visualizations

abn %>% 
  group_by(neighbourhood_group) %>% 
  summarise(avg_price = mean(price)) %>% 
  ggplot(aes( x = neighbourhood_group, y = avg_price )) +
  geom_bar(stat = "identity")

Bivariate Visualization(s)

abn %>% 
  ggplot(aes(y = price, x = number_of_reviews,col=room_type)) +
  geom_point() +
  scale_x_continuous(breaks = seq(from = 0, to = 500, by = 50)) +
  scale_y_continuous(breaks = seq(from = 0, to = 5000, by = 500))

Any additional comments?