DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 5 Solution

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Challenge Tasks:

Challenge 5 Solution

challenge_5
air_bnb
shelton
Introduction to Visualization
Author

Dane Shelton

Published

October 24, 2022

Challenge Tasks:

Today’s challenge:

  1. read in a data set, and describe the data set using both words and any supporting information
  2. tidy data (as needed, including sanity checks)
  3. mutate variables as needed (including sanity checks)
  4. create at least two univariate visualizations
  • try to make them “publication” ready
  • Explain why you choose the specific graph type
  1. Create at least one bivariate visualization
  • try to make them “publication” ready
  • Explain why you choose the specific graph type

R Graph Gallery is a good starting point for thinking about what information is conveyed in standard graph types, and includes example R code.

  • Task 1. Read in data
  • 2. Tidy Data and 3. Mutate
  • 4. and 5. Visualizations

Read in the following dataset:

  • AB_NYC_2019.csv ⭐⭐⭐

Briefly describe the data

The data provides information about New York City Airbnb listings for the year 2019. From a single observation we find a description and location of the unit, price, minimum nights, name of the host, host listings count, review info, and availability throughout the year.

To touch up the data before any other transformations, I renamed/reordered the columns appropriately and removed any observations of units that were available for less than 3 days out of the year.

Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.

In order to draw better visualizations from the variables, I mutated Availability and Host Listings to list them as unranked categorical variables with four levels rather than an integer variable. This will allow us to use aesthetic options like color, shape, or fill to visualize the differences between levels.

After the initial removals on the read-in, the data was nearly tidy; there were only 4 missing values, all within Description. Because it wouldn’t be reasonable to plot all the distinct descriptions, I decided to remove this variable from the plotting data, bnb_plot.

Our tidied data has an observation as one rental unit with 8 variables. bnb_plot has 30644 observations.

Let’s use bnb_plot with ggplot2 to produce visuals

Univariate Visualization 1: Listing Count by Borough

When plotting a bar graph for variable Borough, we can see where in the city our observations are located. The largest bar is Manhattan, indicating that this borough is a popular choice for owners looking to rent out property for travelers. Manhattan makes sense, it is the quintessential NYC borough despite being the smallest by land. It is also the most densely populated area of the city, of course there would be a larger market for short-term rental properties!

Univariate Visualization 2: Listing Count by Availability

Using another bar-plot to view Availability, I was initially surprised to see how many properties on Air Bnb were listed for less than a month out of the year. However, in cities with high movement like NYC, many residents will sublet their private properties even for days at a time to combat the high cost of rent.

Bivariate Visualization 1: Histogram of Price By Borough

Using geom_histogram to produce a distribution of Price, a fill aesthetic was used to visualize the distribution by Borough. We can see that the Bronx has the bulk of observations at cheaper nightly price points, but as price increases, Brooklyn and Manhattan take over the distribution.

Bivariate Visualization 2: Price by Room Type

Creating a Ridge plot of Price by Room Type with package ggridges, I then used a facet_wrap to visualize this relationship by Borough. In a trend consistent with our histogram, we see the Bronx, Queens, and Staten Island having lower price points on all three room types, with Manhattan as the most expensive Borough.

Source Code
---
title: "Challenge 5 Solution"
author: "Dane Shelton"
description: "Introduction to Visualization"
date: "10/24/2022"
format:
  html:
    callout-appearance: simple
    df-print: paged
    callout-icon: false
    toc: true
    code-copy: true
    code-tools: true
categories:
  - challenge_5
  - air_bnb
  - shelton
---

```{r}
#| label: setup
#| include: false
#| warning: false
#| message: false

library(tidyverse)
library(ggplot2)
library(ggridges)

knitr::opts_chunk$set(echo = FALSE, warning=FALSE, message=FALSE)
```

## Challenge Tasks:

Today's challenge:

1.  read in a data set, and describe the data set using both words and any supporting information 
2.  tidy data (as needed, including sanity checks)
3.  mutate variables as needed (including sanity checks)
4.  create at least two univariate visualizations
   - try to make them "publication" ready
   - Explain why you choose the specific graph type
5.  Create at least one bivariate visualization
   - try to make them "publication" ready
   - Explain why you choose the specific graph type

[R Graph Gallery](https://r-graph-gallery.com/) is a good starting point for thinking about what information is conveyed in standard graph types, and includes example R code.

::: panel-tabset

## Task 1. Read in data

Read in the following dataset:

-   AB_NYC_2019.csv ⭐⭐⭐

```{r}
#| label: loading data
#| echo: false

# read in
bnb_og <- readr::read_csv('_data/AB_NYC_2019.csv', show_col_types = FALSE)

# removing clutter and observations of residences with less than a week of availability per year
bnb_og <- bnb_og %>%
            select(-c('id','host_id','host_name','latitude','longitude','last_review','reviews_per_month'))%>%
            filter(availability_365 >= 3)

# rename and relocate
bnb_og <- bnb_og %>%
            rename("Description"=name, 
                   "Borough" = neighbourhood_group, 
                   "Neighbourhood" = neighbourhood,
                   "Room Type" = room_type,
                   "Price"=price,
                   "Min Nights"=minimum_nights,
                   "Review Count"=number_of_reviews,
                   "Host Listings"=calculated_host_listings_count,
                   "Availability per Year"=availability_365) %>%
            relocate("Availability per Year", .before = contains("Review Count"))
```

### Briefly describe the data

The data provides information about New York City Airbnb listings for the year 2019. From a single observation we find a description and location of the unit, price, minimum nights, name of the host, host listings count, review info, and availability throughout the year.

To touch up the data before any other transformations, I renamed/reordered the columns appropriately and removed any observations of units that were available for less than 3 days out of the year.

## 2. Tidy Data and  3. Mutate

Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.

```{r}
#| label: tidy and format
#| include: true

# Getting Rid of any duplicate listings
bnb_og <- distinct(bnb_og)

# Mutating Availability and Host Listings to plot as a 'class' var

bnb_2 <- bnb_og %>%
            mutate("Availability" = case_when(
              `Availability per Year` == 365 ~ "Year-Round",
              `Availability per Year` > 300 ~ "Above Average (300+)",
              `Availability per Year` >= 180 ~ "Average (180+)",
              `Availability per Year` < 180 & `Availability per Year` >=70  ~ "Below Average (70+)",
              `Availability per Year` < 70 & `Availability per Year` >= 30  ~ "Low (30-70)",
              `Availability per Year` < 30 ~ "Very Low (Below 30)"),
                "Host List Count" = case_when(`Host Listings` > 10 ~ "Very Frequent",
                                              `Host Listings` >= 3 ~ "Frequent",
                                              `Host Listings` < 3  ~ "Low",
                                              `Host Listings` == 1 ~ "One"))

bnb_plot <- bnb_2 %>%
              select(!contains(c('per Year', 'Host Listings', 'Description')))

head(bnb_2)
```

In order to draw better visualizations from the variables, I mutated `Availability` and `Host Listings` to list them as unranked categorical variables with four levels rather than an integer variable. This will allow us to use aesthetic options like `color`, `shape`, or `fill` to visualize the differences between levels.

After the initial removals on the read-in, the data was nearly tidy; there were only 4 missing values, all within `Description`. Because it wouldn't be reasonable to plot all the distinct descriptions, I decided to remove this variable from the plotting data, `bnb_plot`. 

Our tidied data has an observation as one rental unit with 8 variables. `bnb_plot` has 30644 observations.

```{r}
#| label: bnb_plot head
#| output: true

head(bnb_plot, n=5)
```


## 4. and 5. Visualizations

Let's use `bnb_plot` with `ggplot2` to produce visuals

::: {.callout-note collapse="true"}

## Univariate Visualization 1: Listing Count by `Borough`
```{r}
#| label: Univariate Visualization
#| include: true

# Boxplot of Airbnb listings by Borough
bar_boro <- ggplot(bnb_plot, aes(x=Borough, fill=Borough))+
                  geom_bar()+
                    scale_fill_brewer(palette = 'Set2')+
                      scale_y_continuous(breaks=seq(0,15000,by=2000))+
                        theme_minimal()+
                          theme(legend.position='none')+
                             labs(title = 'NYC Lisitng Count by Borough', 
                                  x = 'Borough', y='Listings')
                  
                  

bar_boro

```
When plotting a bar graph for variable `Borough`, we can see where in the city our observations are located. The largest bar is Manhattan, indicating that this borough is a popular choice for owners looking to rent out property for travelers. Manhattan makes sense, it is the quintessential NYC borough despite being the smallest by land. It is also the most densely populated area of the city, of course there would be a larger market for short-term rental properties!

:::

:::{.callout-note collapse=true}

## Univariate Visualization 2: Listing Count by `Availability`
```{r}
#| label: Histogram Availability
#| include: true

# Histogram Availability
bar_avail <- bnb_plot %>%
              mutate(Availability_fct=factor(Availability, levels=c('Very Low (Below 30)', 
                                            'Low (30-70)', 
                                            'Below Average (70+)',
                                            'Average (180+)',
                                            'Above Average (300+)',
                                            'Year-Round')))%>%
                ggplot(aes(x=Availability_fct, fill=Availability_fct))+
                  geom_bar()+
                    scale_fill_brewer(palette='Set2')+
                      scale_y_continuous(breaks=seq(0,8000,by=1000))+
                        theme_minimal()+
                          theme(legend.position='none')+
                             labs(title = 'NYC Lisitng Count by Availabilty', 
                                  x = 'Availability (Days/Yr)', y='Listings')
bar_avail
                  

```

Using another bar-plot to view `Availability`, I was initially surprised to see how many properties on Air Bnb were listed for less than a month out of the year. However, in cities with high movement like NYC, many residents will sublet their private properties even for days at a time to combat the high cost of rent.

:::

:::{.callout-note collapse=true}

## Bivariate Visualization 1: Histogram of `Price` By `Borough`

```{r}
#| label: Price by Borough Hist
#| include: true

# Histogram of Price

hist_price_boro <- ggplot(bnb_plot,aes(x=Price, fill=Borough)) +
                    geom_histogram(binwidth = 50, position = 'stack', alpha=.8) +
                      coord_cartesian(xlim = c(0,1200), ylim=c(0,9000)) +
                        scale_y_continuous(breaks=seq(0,9000,by=3000)) +
                          scale_x_continuous(breaks=seq(0,1250,by=120))+
                           scale_fill_brewer(palette='Set2')+
                           theme_minimal()+
                             labs(title = 'NYC Air BnB Prices', 
                                  x = 'USD/Night', y='Frequency')
 hist_price_boro
```

Using `geom_histogram` to produce a distribution of `Price`, a `fill` aesthetic was used to visualize the distribution by `Borough`. We can see that the Bronx has the bulk of observations at cheaper nightly price points, but as price increases, Brooklyn and Manhattan take over the distribution.

:::

:::{.callout-note collapse="true"}

## Bivariate Visualization 2: `Price` by `Room Type` 
```{r}
#| label: Ridgeline- Price x Room X Boro
#| include: true


ridge_room_price <- ggplot(bnb_plot, aes(x=Price, y=`Room Type`, fill=`Room Type`))+
                      geom_density_ridges(position='identity', alpha=0.8, bandwidth=8.5) +
                        theme_ridges() + 
                          scale_fill_brewer(palette='Set2')+
                             scale_x_continuous(breaks=seq(0,600,by=300))+
                                coord_cartesian(xlim = c(0,1000)) +
                                   labs(title = 'NYC Air Bnb Price by Room Type', 
                                         x = 'Price', y='Room Type')+
                                           facet_wrap('Borough')

ridge_room_price
```
Creating a Ridge plot of `Price` by `Room Type` with package `ggridges`, I then used a `facet_wrap` to visualize this relationship by `Borough`. In a trend consistent with our histogram, we see the Bronx, Queens, and Staten Island having lower price points on all three room types, with Manhattan as the most expensive Borough.

:::

:::