Challenge 5 Solution

challenge_5

air_bnb

shelton

Introduction to Visualization

Author

Dane Shelton

Published

October 24, 2022

Challenge Tasks:

Today’s challenge:

read in a data set, and describe the data set using both words and any supporting information
tidy data (as needed, including sanity checks)
mutate variables as needed (including sanity checks)
create at least two univariate visualizations

try to make them “publication” ready
Explain why you choose the specific graph type

Create at least one bivariate visualization

try to make them “publication” ready
Explain why you choose the specific graph type

R Graph Gallery is a good starting point for thinking about what information is conveyed in standard graph types, and includes example R code.

Read in the following dataset:

AB_NYC_2019.csv ⭐⭐⭐

Briefly describe the data

The data provides information about New York City Airbnb listings for the year 2019. From a single observation we find a description and location of the unit, price, minimum nights, name of the host, host listings count, review info, and availability throughout the year.

To touch up the data before any other transformations, I renamed/reordered the columns appropriately and removed any observations of units that were available for less than 3 days out of the year.

Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.

In order to draw better visualizations from the variables, I mutated Availability and Host Listings to list them as unranked categorical variables with four levels rather than an integer variable. This will allow us to use aesthetic options like color, shape, or fill to visualize the differences between levels.

After the initial removals on the read-in, the data was nearly tidy; there were only 4 missing values, all within Description. Because it wouldn’t be reasonable to plot all the distinct descriptions, I decided to remove this variable from the plotting data, bnb_plot.

Our tidied data has an observation as one rental unit with 8 variables. bnb_plot has 30644 observations.

Let’s use bnb_plot with ggplot2 to produce visuals

Univariate Visualization 1: Listing Count by Borough

When plotting a bar graph for variable Borough, we can see where in the city our observations are located. The largest bar is Manhattan, indicating that this borough is a popular choice for owners looking to rent out property for travelers. Manhattan makes sense, it is the quintessential NYC borough despite being the smallest by land. It is also the most densely populated area of the city, of course there would be a larger market for short-term rental properties!

Univariate Visualization 2: Listing Count by Availability

Using another bar-plot to view Availability, I was initially surprised to see how many properties on Air Bnb were listed for less than a month out of the year. However, in cities with high movement like NYC, many residents will sublet their private properties even for days at a time to combat the high cost of rent.

Bivariate Visualization 1: Histogram of Price By Borough

Using geom_histogram to produce a distribution of Price, a fill aesthetic was used to visualize the distribution by Borough. We can see that the Bronx has the bulk of observations at cheaper nightly price points, but as price increases, Brooklyn and Manhattan take over the distribution.

Bivariate Visualization 2: Price by Room Type

Creating a Ridge plot of Price by Room Type with package ggridges, I then used a facet_wrap to visualize this relationship by Borough. In a trend consistent with our histogram, we see the Bronx, Queens, and Staten Island having lower price points on all three room types, with Manhattan as the most expensive Borough.

--- title: "Challenge 5 Solution" author: "Dane Shelton" description: "Introduction to Visualization" date: "10/24/2022" format: html: callout-appearance: simple df-print: paged callout-icon: false toc: true code-copy: true code-tools: true categories: - challenge_5 - air_bnb - shelton --- ```{r} #| label: setup #| include: false #| warning: false #| message: false library(tidyverse) library(ggplot2) library(ggridges) knitr::opts_chunk$set(echo = FALSE, warning=FALSE, message=FALSE) ``` ## Challenge Tasks: Today's challenge: 1. read in a data set, and describe the data set using both words and any supporting information 2. tidy data (as needed, including sanity checks) 3. mutate variables as needed (including sanity checks) 4. create at least two univariate visualizations - try to make them "publication" ready - Explain why you choose the specific graph type 5. Create at least one bivariate visualization - try to make them "publication" ready - Explain why you choose the specific graph type [R Graph Gallery](https://r-graph-gallery.com/) is a good starting point for thinking about what information is conveyed in standard graph types, and includes example R code. ::: panel-tabset ## Task 1. Read in data Read in the following dataset: - AB_NYC_2019.csv ⭐⭐⭐ ```{r} #| label: loading data #| echo: false # read in bnb_og <- readr::read_csv('_data/AB_NYC_2019.csv', show_col_types = FALSE) # removing clutter and observations of residences with less than a week of availability per year bnb_og <- bnb_og %>% select(-c('id','host_id','host_name','latitude','longitude','last_review','reviews_per_month'))%>% filter(availability_365 >= 3) # rename and relocate bnb_og <- bnb_og %>% rename("Description"=name, "Borough" = neighbourhood_group, "Neighbourhood" = neighbourhood, "Room Type" = room_type, "Price"=price, "Min Nights"=minimum_nights, "Review Count"=number_of_reviews, "Host Listings"=calculated_host_listings_count, "Availability per Year"=availability_365) %>% relocate("Availability per Year", .before = contains("Review Count")) ``` ### Briefly describe the data The data provides information about New York City Airbnb listings for the year 2019. From a single observation we find a description and location of the unit, price, minimum nights, name of the host, host listings count, review info, and availability throughout the year. To touch up the data before any other transformations, I renamed/reordered the columns appropriately and removed any observations of units that were available for less than 3 days out of the year. ## 2. Tidy Data and 3. Mutate Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here. ```{r} #| label: tidy and format #| include: true # Getting Rid of any duplicate listings bnb_og <- distinct(bnb_og) # Mutating Availability and Host Listings to plot as a 'class' var bnb_2 <- bnb_og %>% mutate("Availability" = case_when( `Availability per Year` == 365 ~ "Year-Round", `Availability per Year` > 300 ~ "Above Average (300+)", `Availability per Year` >= 180 ~ "Average (180+)", `Availability per Year` < 180 & `Availability per Year` >=70 ~ "Below Average (70+)", `Availability per Year` < 70 & `Availability per Year` >= 30 ~ "Low (30-70)", `Availability per Year` < 30 ~ "Very Low (Below 30)"), "Host List Count" = case_when(`Host Listings` > 10 ~ "Very Frequent", `Host Listings` >= 3 ~ "Frequent", `Host Listings` < 3 ~ "Low", `Host Listings` == 1 ~ "One")) bnb_plot <- bnb_2 %>% select(!contains(c('per Year', 'Host Listings', 'Description'))) head(bnb_2) ``` In order to draw better visualizations from the variables, I mutated `Availability` and `Host Listings` to list them as unranked categorical variables with four levels rather than an integer variable. This will allow us to use aesthetic options like `color`, `shape`, or `fill` to visualize the differences between levels. After the initial removals on the read-in, the data was nearly tidy; there were only 4 missing values, all within `Description`. Because it wouldn't be reasonable to plot all the distinct descriptions, I decided to remove this variable from the plotting data, `bnb_plot`. Our tidied data has an observation as one rental unit with 8 variables. `bnb_plot` has 30644 observations. ```{r} #| label: bnb_plot head #| output: true head(bnb_plot, n=5) ``` ## 4. and 5. Visualizations Let's use `bnb_plot` with `ggplot2` to produce visuals ::: {.callout-note collapse="true"} ## Univariate Visualization 1: Listing Count by `Borough` ```{r} #| label: Univariate Visualization #| include: true # Boxplot of Airbnb listings by Borough bar_boro <- ggplot(bnb_plot, aes(x=Borough, fill=Borough))+ geom_bar()+ scale_fill_brewer(palette = 'Set2')+ scale_y_continuous(breaks=seq(0,15000,by=2000))+ theme_minimal()+ theme(legend.position='none')+ labs(title = 'NYC Lisitng Count by Borough', x = 'Borough', y='Listings') bar_boro ``` When plotting a bar graph for variable `Borough`, we can see where in the city our observations are located. The largest bar is Manhattan, indicating that this borough is a popular choice for owners looking to rent out property for travelers. Manhattan makes sense, it is the quintessential NYC borough despite being the smallest by land. It is also the most densely populated area of the city, of course there would be a larger market for short-term rental properties! ::: :::{.callout-note collapse=true} ## Univariate Visualization 2: Listing Count by `Availability` ```{r} #| label: Histogram Availability #| include: true # Histogram Availability bar_avail <- bnb_plot %>% mutate(Availability_fct=factor(Availability, levels=c('Very Low (Below 30)', 'Low (30-70)', 'Below Average (70+)', 'Average (180+)', 'Above Average (300+)', 'Year-Round')))%>% ggplot(aes(x=Availability_fct, fill=Availability_fct))+ geom_bar()+ scale_fill_brewer(palette='Set2')+ scale_y_continuous(breaks=seq(0,8000,by=1000))+ theme_minimal()+ theme(legend.position='none')+ labs(title = 'NYC Lisitng Count by Availabilty', x = 'Availability (Days/Yr)', y='Listings') bar_avail ``` Using another bar-plot to view `Availability`, I was initially surprised to see how many properties on Air Bnb were listed for less than a month out of the year. However, in cities with high movement like NYC, many residents will sublet their private properties even for days at a time to combat the high cost of rent. ::: :::{.callout-note collapse=true} ## Bivariate Visualization 1: Histogram of `Price` By `Borough` ```{r} #| label: Price by Borough Hist #| include: true # Histogram of Price hist_price_boro <- ggplot(bnb_plot,aes(x=Price, fill=Borough)) + geom_histogram(binwidth = 50, position = 'stack', alpha=.8) + coord_cartesian(xlim = c(0,1200), ylim=c(0,9000)) + scale_y_continuous(breaks=seq(0,9000,by=3000)) + scale_x_continuous(breaks=seq(0,1250,by=120))+ scale_fill_brewer(palette='Set2')+ theme_minimal()+ labs(title = 'NYC Air BnB Prices', x = 'USD/Night', y='Frequency') hist_price_boro ``` Using `geom_histogram` to produce a distribution of `Price`, a `fill` aesthetic was used to visualize the distribution by `Borough`. We can see that the Bronx has the bulk of observations at cheaper nightly price points, but as price increases, Brooklyn and Manhattan take over the distribution. ::: :::{.callout-note collapse="true"} ## Bivariate Visualization 2: `Price` by `Room Type` ```{r} #| label: Ridgeline- Price x Room X Boro #| include: true ridge_room_price <- ggplot(bnb_plot, aes(x=Price, y=`Room Type`, fill=`Room Type`))+ geom_density_ridges(position='identity', alpha=0.8, bandwidth=8.5) + theme_ridges() + scale_fill_brewer(palette='Set2')+ scale_x_continuous(breaks=seq(0,600,by=300))+ coord_cartesian(xlim = c(0,1000)) + labs(title = 'NYC Air Bnb Price by Room Type', x = 'Price', y='Room Type')+ facet_wrap('Borough') ridge_room_price ``` Creating a Ridge plot of `Price` by `Room Type` with package `ggridges`, I then used a `facet_wrap` to visualize this relationship by `Borough`. In a trend consistent with our histogram, we see the Bronx, Queens, and Staten Island having lower price points on all three room types, with Manhattan as the most expensive Borough. ::: :::