DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 7 Solutions

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Challenge Overview
  • Mutate
  • Visualization with Multiple Dimensions

Challenge 7 Solutions

challenge_7
hotel_bookings
australian_marriage
air_bnb
eggs
abc_poll
faostat
usa_households
Visualizing Multiple Dimensions
Author

Caitlin Rowley

Published

November 4, 2022

library(tidyverse)
library(ggplot2)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. tidy data (as needed, including sanity checks)
  3. mutate variables as needed (including sanity checks)
  4. Recreate at least two graphs from previous exercises, but introduce at least one additional dimension that you omitted before using ggplot functionality (color, shape, line, facet, etc) The goal is not to create unneeded chart ink (Tufte), but to concisely capture variation in additional dimensions that were collapsed in your earlier 2 or 3 dimensional graphs.
  • Explain why you choose the specific graph type
  1. If you haven’t tried in previous weeks, work this week to make your graphs “publication” ready with titles, captions, and pretty axis labels and other viewer-friendly features

R Graph Gallery is a good starting point for thinking about what information is conveyed in standard graph types, and includes example R code. And anyone not familiar with Edward Tufte should check out his fantastic books and courses on data visualizaton.

(be sure to only include the category tags for the data you use!)

I am going to work with my data set for my final project, which is a data set from AirBnB.

# read in data:

setwd("C:/Users/caitr/OneDrive/Documents/DACSS/601_Fall_2022/posts")
Error in setwd("C:/Users/caitr/OneDrive/Documents/DACSS/601_Fall_2022/posts"): cannot change working directory
Boston <- read_csv("Boston AirBnB Data.csv")
Error: 'Boston AirBnB Data.csv' does not exist in current working directory ('C:/Users/srika/OneDrive/Desktop/601_Fall_2022/posts').

I am using data from the Inside AirBnB website capturing data related to summary information and metrics for listings in Boston, MA.

Tidy Data (as needed):

# look for duplicates
# look for missing values
# look for outliers - review how to clean
# remember na.rm=TRUE for calculations

# clean the data:

# At first glance, it seems as though there are no values in the column titled "neighbourhood_group." So, I will find all unique values within that column to determine whether it can be removed from my tidy data set.

unique(Boston[c("neighbourhood_group")])
Error in unique(Boston[c("neighbourhood_group")]): object 'Boston' not found
# I now know that there is no data within this column. I will remove it from my data set.

Boston_tidy <- subset(Boston, select = -c(neighbourhood_group))
Error in subset(Boston, select = -c(neighbourhood_group)): object 'Boston' not found
# I can see from viewing this data frame that there are no other columns that are absent any values, so I will move on to other tidying tasks.

# rename columns:

names(Boston_tidy) <- c('room_id', 'room_name', 'host_id', 'host_name', 'neighborhood', 'room_latitude', 'room_longitude', 'room_type', 'room_price', 'min_nights', 'number_reviews', 'last_review', 'reviews_per_month', 'host_listings', 'availability_next_365', 'number_reviews_LTM', 'room_license')
Error in names(Boston_tidy) <- c("room_id", "room_name", "host_id", "host_name", : object 'Boston_tidy' not found
# find duplicates:

duplicates <- duplicated(Boston_tidy)
Error in duplicated(Boston_tidy): object 'Boston_tidy' not found
# reached "max.print", so I will increase the limit and identify if any values within the vector = TRUE:

options(max.print=999999)
duplicates["TRUE"]
Error in eval(expr, envir, enclos): object 'duplicates' not found
# "TRUE" = NA, so I now know that there are no duplicates in my data set.

“Boston_tidy” represents AirBnB rental listing data for the city of Boston over the last twelve months. The data frame has 17 variables and 5,185 rows of data. Each row now represents one unique observation—or in this case, a unique rental listing—that includes data related to the following variables: (1) room/listing ID number, (2) name of the room/listing, (3) listing host ID number, (4) listing host name, (5) room/listing neighborhood, (6) room/listing latitude, (7) room/listing longitude, (8) type of room/listing, (9) room/listing price, (10) minimum number of nights for rent, (11) number of room/listing reviews, (12) most recent room/listing review, (13) number of room/listing reviews per month, (14) number host-specific listings, (15) room/listing availability over the next year, (16) number of reviews for room/listing over the past 12 months, and (17) room/listing licensure status.

I will now generate summary statistics:

# summary statistics for numeric variables:

Boston_tidy %>% 
  select(room_price, min_nights, number_reviews, host_listings, availability_next_365) %>% 
  summary()
Error in select(., room_price, min_nights, number_reviews, host_listings, : object 'Boston_tidy' not found

Mutate

First, I will mutate the “latitude” and “longitude” variables to create a “coordinates” variable.

# mutate lat and lon to create "room_coordinates"
# keep lat and lon columns for now

Boston_mutate <- Boston_tidy %>%
mutate("room_coordinates" = paste(room_latitude, room_longitude))
Error in mutate(., room_coordinates = paste(room_latitude, room_longitude)): object 'Boston_tidy' not found
data.frame()
data frame with 0 columns and 0 rows

Next, I will generate a new data set that contains a variable for median room prices:

# find median room prices by neighborhood and room type:

Boston_median <- Boston_mutate%>%
  filter(room_price>0) %>%
  group_by(room_type, neighborhood)%>%
    summarize(median_price = median(room_price))
Error in filter(., room_price > 0): object 'Boston_mutate' not found

Visualization with Multiple Dimensions

I will next generate visualizations with multiple dimensions.

# generate stacked bar chart indicating percent out of 100
library(RColorBrewer)
library(ggplot2)
library(ggtext)
Error in library(ggtext): there is no package called 'ggtext'
ggplot(Boston_tidy, aes(x=neighborhood, y=room_price, fill=neighborhood)) + geom_col(stat="identity") +
    scale_fill_hue() +
theme_classic() +
theme(axis.text.x = element_markdown(angle=90, hjust=1))
Error in ggplot(Boston_tidy, aes(x = neighborhood, y = room_price, fill = neighborhood)): object 'Boston_tidy' not found

This is interesting, but not a great way to visualize this data, as we don't get to see the distribution of data points. I will next generate a geom_point plot that omits outliers and contains a boxplot overlay.

# remove outliers:

is_outlier <- function(x) {
  return(x < quantile(x, 0.25) - 1.5 * IQR(x) | x > quantile(x, 0.75) + 1.5 * IQR(x))
}

Boston_trim <- Boston_mutate %>%
  filter(!is_outlier(room_price))
Error in filter(., !is_outlier(room_price)): object 'Boston_mutate' not found
# create dataframe:

Boston_trim%>%
  filter(room_price>0, room_price<600) %>%
  group_by(room_type, neighborhood) 
Error in filter(., room_price > 0, room_price < 600): object 'Boston_trim' not found
#generate geom_point chart

Boston_trim%>%
 group_by(room_type, neighborhood)%>%
  ggplot(aes(x=neighborhood, y=room_price)) +
  geom_point(alpha=.08, size=5, color = "light pink")+
  labs(x="Neighborhood",y="Price Per Night", title = "Boston Airbnb Rental Prices by Neighborhood")+
theme_classic() +
  geom_boxplot()
Error in group_by(., room_type, neighborhood): object 'Boston_trim' not found
  theme(axis.text.x = element_markdown(angle=90, hjust=1))
Error in element_markdown(angle = 90, hjust = 1): could not find function "element_markdown"

I will next generate a violin plot that also omits outliers:

# generate violin plot:

Boston_trim%>%
  ggplot(aes(x=room_type, y=room_price, fill=neighborhood)) + 
  geom_violin(width = 1, size = .5, 
              scale = "area",
              trim = TRUE)+
  coord_flip()+
  labs(x="Room Type", y="Room Price", title = "Price by Room Type and Neighborhood")
Error in ggplot(., aes(x = room_type, y = room_price, fill = neighborhood)): object 'Boston_trim' not found

Given the number of neighborhoods, this is extremely difficult to read. I will use facet wrapping to adjust the visualization.

# violin plot with facet wrapping:

Boston_trim %>% 
  ggplot(aes(x=neighborhood, y=room_price, fill=room_type)) + 
  geom_violin(width = 1, size = .5, 
              scale = "area",
              trim = FALSE)+
  facet_wrap(vars(neighborhood), scales = "free")+
  theme_bw()+
  labs(x="Neighborhood", y = "Price", fill= "Room Type", title = "Boston Airbnb Prices by Neighborhood and Room Type", subtitle= "2021")
Error in ggplot(., aes(x = neighborhood, y = room_price, fill = room_type)): object 'Boston_trim' not found

I want to figure out how to adjust the scale of the violin plots themselves so that I can read the data. I have tried adjusting the scale in the facet wrapping code chunk, but that hasn't worked so far. I will continue working on this.

I will next generate a tree map. It won't be helpful in terms of visualizing prices, but I wanted to give it a shot as practice.

library(treemap)
library(RColorBrewer)

Boston_tidy%>%
treemap(room_type,
        index= c("neighborhood", "room_type"),
        vSize = "room_price",
        type = "index",
        title = "Median Listing Price by Neighborhood, Airbnb NYC 2019",
        overlap.labels = 0) +
  scale_color_brewer(palette = "PiYG")
Error in treemap(., room_type, index = c("neighborhood", "room_type"), : object 'Boston_tidy' not found
# I'm not sure why the Brewer palette isn't applying to the tree map.
Source Code
---
title: "Challenge 7 Solutions"
author: "Caitlin Rowley"
description: "Visualizing Multiple Dimensions"
date: "11/04/2022"
format:
  html:
    toc: true
    code-copy: true
    code-tools: true
categories:
  - challenge_7
  - hotel_bookings
  - australian_marriage
  - air_bnb
  - eggs
  - abc_poll
  - faostat
  - usa_households
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)
library(ggplot2)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

## Challenge Overview

Today's challenge is to:

1)  read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
2)  tidy data (as needed, including sanity checks)
3)  mutate variables as needed (including sanity checks)
4)  Recreate at least two graphs from previous exercises, but introduce at least one additional dimension that you omitted before using ggplot functionality (color, shape, line, facet, etc) The goal is not to create unneeded [chart ink (Tufte)](https://www.edwardtufte.com/tufte/), but to concisely capture variation in additional dimensions that were collapsed in your earlier 2 or 3 dimensional graphs.

-   Explain why you choose the specific graph type

5)  If you haven't tried in previous weeks, work this week to make your graphs "publication" ready with titles, captions, and pretty axis labels and other viewer-friendly features

[R Graph Gallery](https://r-graph-gallery.com/) is a good starting point for thinking about what information is conveyed in standard graph types, and includes example R code. And anyone not familiar with Edward Tufte should check out his [fantastic books](https://www.edwardtufte.com/tufte/books_vdqi) and [courses on data visualizaton.](https://www.edwardtufte.com/tufte/courses)

(be sure to only include the category tags for the data you use!)

I am going to work with my data set for my final project, which is a data set from AirBnB.

```{r}
# read in data:

setwd("C:/Users/caitr/OneDrive/Documents/DACSS/601_Fall_2022/posts")
Boston <- read_csv("Boston AirBnB Data.csv")
```

I am using data from the Inside AirBnB website capturing data related to summary information and metrics for listings in Boston, MA.

Tidy Data (as needed):

```{r}
# look for duplicates
# look for missing values
# look for outliers - review how to clean
# remember na.rm=TRUE for calculations

# clean the data:

# At first glance, it seems as though there are no values in the column titled "neighbourhood_group." So, I will find all unique values within that column to determine whether it can be removed from my tidy data set.

unique(Boston[c("neighbourhood_group")])

# I now know that there is no data within this column. I will remove it from my data set.

Boston_tidy <- subset(Boston, select = -c(neighbourhood_group))

# I can see from viewing this data frame that there are no other columns that are absent any values, so I will move on to other tidying tasks.

# rename columns:

names(Boston_tidy) <- c('room_id', 'room_name', 'host_id', 'host_name', 'neighborhood', 'room_latitude', 'room_longitude', 'room_type', 'room_price', 'min_nights', 'number_reviews', 'last_review', 'reviews_per_month', 'host_listings', 'availability_next_365', 'number_reviews_LTM', 'room_license')

# find duplicates:

duplicates <- duplicated(Boston_tidy)

# reached "max.print", so I will increase the limit and identify if any values within the vector = TRUE:

options(max.print=999999)
duplicates["TRUE"]

# "TRUE" = NA, so I now know that there are no duplicates in my data set.

```

"Boston_tidy" represents AirBnB rental listing data for the city of Boston over the last twelve months. The data frame has 17 variables and 5,185 rows of data. Each row now represents one unique observation---or in this case, a unique rental listing---that includes data related to the following variables: (1) room/listing ID number, (2) name of the room/listing, (3) listing host ID number, (4) listing host name, (5) room/listing neighborhood, (6) room/listing latitude, (7) room/listing longitude, (8) type of room/listing, (9) room/listing price, (10) minimum number of nights for rent, (11) number of room/listing reviews, (12) most recent room/listing review, (13) number of room/listing reviews per month, (14) number host-specific listings, (15) room/listing availability over the next year, (16) number of reviews for room/listing over the past 12 months, and (17) room/listing licensure status.

I will now generate summary statistics:

```{r}
# summary statistics for numeric variables:

Boston_tidy %>% 
  select(room_price, min_nights, number_reviews, host_listings, availability_next_365) %>% 
  summary()
```

## Mutate

First, I will mutate the "latitude" and "longitude" variables to create a "coordinates" variable.

```{r}
# mutate lat and lon to create "room_coordinates"
# keep lat and lon columns for now

Boston_mutate <- Boston_tidy %>%
mutate("room_coordinates" = paste(room_latitude, room_longitude))
data.frame()
```

Next, I will generate a new data set that contains a variable for median room prices:

```{r}
# find median room prices by neighborhood and room type:

Boston_median <- Boston_mutate%>%
  filter(room_price>0) %>%
  group_by(room_type, neighborhood)%>%
    summarize(median_price = median(room_price))
```

## Visualization with Multiple Dimensions

I will next generate visualizations with multiple dimensions.

```{r}
# generate stacked bar chart indicating percent out of 100
library(RColorBrewer)
library(ggplot2)
library(ggtext)

ggplot(Boston_tidy, aes(x=neighborhood, y=room_price, fill=neighborhood)) + geom_col(stat="identity") +
    scale_fill_hue() +
theme_classic() +
theme(axis.text.x = element_markdown(angle=90, hjust=1))


```

This is interesting, but not a great way to visualize this data, as we don\'t get to see the distribution of data points. I will next generate a geom_point plot that omits outliers and contains a boxplot overlay.

```{r}
# remove outliers:

is_outlier <- function(x) {
  return(x < quantile(x, 0.25) - 1.5 * IQR(x) | x > quantile(x, 0.75) + 1.5 * IQR(x))
}

Boston_trim <- Boston_mutate %>%
  filter(!is_outlier(room_price))

# create dataframe:

Boston_trim%>%
  filter(room_price>0, room_price<600) %>%
  group_by(room_type, neighborhood) 
```

```{r}
#generate geom_point chart

Boston_trim%>%
 group_by(room_type, neighborhood)%>%
  ggplot(aes(x=neighborhood, y=room_price)) +
  geom_point(alpha=.08, size=5, color = "light pink")+
  labs(x="Neighborhood",y="Price Per Night", title = "Boston Airbnb Rental Prices by Neighborhood")+
theme_classic() +
  geom_boxplot()
  theme(axis.text.x = element_markdown(angle=90, hjust=1))
```

I will next generate a violin plot that also omits outliers:

```{r}
# generate violin plot:

Boston_trim%>%
  ggplot(aes(x=room_type, y=room_price, fill=neighborhood)) + 
  geom_violin(width = 1, size = .5, 
              scale = "area",
              trim = TRUE)+
  coord_flip()+
  labs(x="Room Type", y="Room Price", title = "Price by Room Type and Neighborhood")
```

Given the number of neighborhoods, this is extremely difficult to read. I will use facet wrapping to adjust the visualization.

```{r}
# violin plot with facet wrapping:

Boston_trim %>% 
  ggplot(aes(x=neighborhood, y=room_price, fill=room_type)) + 
  geom_violin(width = 1, size = .5, 
              scale = "area",
              trim = FALSE)+
  facet_wrap(vars(neighborhood), scales = "free")+
  theme_bw()+
  labs(x="Neighborhood", y = "Price", fill= "Room Type", title = "Boston Airbnb Prices by Neighborhood and Room Type", subtitle= "2021")
```

I want to figure out how to adjust the scale of the violin plots themselves so that I can read the data. I have tried adjusting the scale in the facet wrapping code chunk, but that hasn\'t worked so far. I will continue working on this.

I will next generate a tree map. It won\'t be helpful in terms of visualizing prices, but I wanted to give it a shot as practice.

```{r}
library(treemap)
library(RColorBrewer)

Boston_tidy%>%
treemap(room_type,
        index= c("neighborhood", "room_type"),
        vSize = "room_price",
        type = "index",
        title = "Median Listing Price by Neighborhood, Airbnb NYC 2019",
        overlap.labels = 0) +
  scale_color_brewer(palette = "PiYG")

# I'm not sure why the Brewer palette isn't applying to the tree map.
```