Challenge 6

challenge_6

air_bnb

tidyverse

ggplot2

summarytools

ggridges

Visualizing Time and Relationships

Author

Saaradhaa M

Published

August 23, 2022

library(tidyverse)
library(ggplot2)
library(summarytools)
library(ggridges)

knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)

::: panel-tabset

Read in data

I’ll be working with the Airbnb dataset, since I’ve already worked on several of the other datasets on the list for today’s challenge.

airbnb <- read_csv("_data/AB_NYC_2019.csv", show_col_types = FALSE)
airbnb <- complete(airbnb)
print(dfSummary(airbnb, varnumbers = FALSE, plain.ascii = FALSE, graph.magnif = 0.30, style = "grid", valid.col = FALSE), 
      method = 'render', table.classes = 'table-condensed')

Data Frame Summary

airbnb

Dimensions: 48895 x 16
Duplicates: 0

Variable

Stats / Values

Freqs (% of Valid)

Graph

Missing

id [numeric]

Mean (sd) : 19017143 (10983108)

min ≤ med ≤ max:

2539 ≤ 19677284 ≤ 36487245

IQR (CV) : 19680234 (0.6)

48895 distinct values

0 (0.0%)

name [character]

1. Hillside Hotel

2. Home away from home

3. New york Multi-unit build

4. Brooklyn Apartment

5. Loft Suite @ The Box Hous

6. Private Room

7. Artsy Private BR in Fort

8. Private room

9. Beautiful Brooklyn Browns

10. Cozy Brooklyn Apartment

[ 47884 others ]

18	(	0.0%	)
17	(	0.0%	)
16	(	0.0%	)
12	(	0.0%	)
11	(	0.0%	)
11	(	0.0%	)
10	(	0.0%	)
10	(	0.0%	)
8	(	0.0%	)
8	(	0.0%	)
48758	(	99.8%	)

16 (0.0%)

host_id [numeric]

Mean (sd) : 67620011 (78610967)

min ≤ med ≤ max:

2438 ≤ 30793816 ≤ 274321313

IQR (CV) : 99612390 (1.2)

37457 distinct values

0 (0.0%)

host_name [character]

1. Michael

2. David

3. Sonder (NYC)

4. John

5. Alex

6. Blueground

7. Sarah

8. Daniel

9. Jessica

10. Maria

[ 11442 others ]

417	(	0.9%	)
403	(	0.8%	)
327	(	0.7%	)
294	(	0.6%	)
279	(	0.6%	)
232	(	0.5%	)
227	(	0.5%	)
226	(	0.5%	)
205	(	0.4%	)
204	(	0.4%	)
46060	(	94.2%	)

21 (0.0%)

neighbourhood_group [character]

1. Bronx

2. Brooklyn

3. Manhattan

4. Queens

5. Staten Island

1091	(	2.2%	)
20104	(	41.1%	)
21661	(	44.3%	)
5666	(	11.6%	)
373	(	0.8%	)

0 (0.0%)

neighbourhood [character]

1. Williamsburg

2. Bedford-Stuyvesant

3. Harlem

4. Bushwick

5. Upper West Side

6. Hell's Kitchen

7. East Village

8. Upper East Side

9. Crown Heights

10. Midtown

[ 211 others ]

3920	(	8.0%	)
3714	(	7.6%	)
2658	(	5.4%	)
2465	(	5.0%	)
1971	(	4.0%	)
1958	(	4.0%	)
1853	(	3.8%	)
1798	(	3.7%	)
1564	(	3.2%	)
1545	(	3.2%	)
25449	(	52.0%	)

0 (0.0%)

latitude [numeric]

Mean (sd) : 40.7 (0.1)

min ≤ med ≤ max:

40.5 ≤ 40.7 ≤ 40.9

IQR (CV) : 0.1 (0)

19048 distinct values

0 (0.0%)

longitude [numeric]

Mean (sd) : -74 (0)

min ≤ med ≤ max:

-74.2 ≤ -74 ≤ -73.7

IQR (CV) : 0 (0)

14718 distinct values

0 (0.0%)

room_type [character]

1. Entire home/apt

2. Private room

3. Shared room

25409	(	52.0%	)
22326	(	45.7%	)
1160	(	2.4%	)

0 (0.0%)

price [numeric]

Mean (sd) : 152.7 (240.2)

min ≤ med ≤ max:

0 ≤ 106 ≤ 10000

IQR (CV) : 106 (1.6)

674 distinct values

0 (0.0%)

minimum_nights [numeric]

Mean (sd) : 7 (20.5)

min ≤ med ≤ max:

1 ≤ 3 ≤ 1250

IQR (CV) : 4 (2.9)

109 distinct values

0 (0.0%)

number_of_reviews [numeric]

Mean (sd) : 23.3 (44.6)

min ≤ med ≤ max:

0 ≤ 5 ≤ 629

IQR (CV) : 23 (1.9)

394 distinct values

0 (0.0%)

last_review [Date]

min : 2011-03-28

med : 2019-05-19

max : 2019-07-08

range : 8y 3m 10d

1764 distinct values

10052 (20.6%)

reviews_per_month [numeric]

Mean (sd) : 1.4 (1.7)

min ≤ med ≤ max:

0 ≤ 0.7 ≤ 58.5

IQR (CV) : 1.8 (1.2)

937 distinct values

10052 (20.6%)

calculated_host_listings_count [numeric]

Mean (sd) : 7.1 (33)

min ≤ med ≤ max:

1 ≤ 1 ≤ 327

IQR (CV) : 1 (4.6)

47 distinct values

0 (0.0%)

availability_365 [numeric]

Mean (sd) : 112.8 (131.6)

min ≤ med ≤ max:

0 ≤ 45 ≤ 365

IQR (CV) : 227 (1.2)

366 distinct values

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-08-23

Briefly describe and tidy data

There are 16 columns and 48895 rows. The dataset describes Airbnb listings in NYC in 2019. Each row represents a listing with a unique ID and name, host ID and name, etc. I’ll tidy the dataset by converting neighbourhood_group, neighbourhood and room_type to factors.

# changing 3 columns to factors.
airbnb <- airbnb %>% mutate(neighbourhood_group = as.factor(neighbourhood_group), neighbourhood = as.factor(neighbourhood), room_type = as.factor(room_type))

# sanity check.
airbnb

Time Dependent Visualization

I want to visualize how last_review differs across neighbourhood_group - I’ll try out a histogram and a ridgeline plot.

# histogram - filter out NA values.
airbnb %>% filter(! is.na(last_review)) %>%
  filter(! is.na(neighbourhood_group)) %>%
  ggplot(aes(last_review, fill = neighbourhood_group)) + 
  geom_histogram(aes(y = ..density..), alpha = 0.2, binwidth = 200) +
  geom_density(alpha = 0.9) +
  theme_minimal() + 
  labs(title = "Last Review Date of Airbnb Listings in NYC (2019)") +
  facet_wrap(vars(neighbourhood_group)) +
  theme(axis.text.x=element_text(angle=90,hjust=1)) +
  scale_x_date(date_labels = "%m-%Y")

# ridgeline - filter out NA values.
airbnb %>% filter(! is.na(last_review)) %>%
  filter(! is.na(neighbourhood_group)) %>%
  ggplot(aes(last_review, neighbourhood_group, fill = neighbourhood_group)) +
  geom_density_ridges_gradient(scale = 2, rel_min_height = 0.02) +
  labs(title = "Last Review Date of Airbnb Listings in NYC (2019)") +
  theme_minimal()

The ridgeline plot is easier to visualise, because we can see last_review for each neighbourhood_group stacked on top of one another (credit for the packages and code is here) - reviews for the listings tend to be quite recent across all neighbourhood groups. Manhattan has the oldest reviews, while Staten Island has the newest reviews.

Let’s see if a time series graph might also work (this time with reviews_per_month).

# time series - filter out NA values.
airbnb %>% filter(! is.na(last_review)) %>%
  ggplot(aes(x = last_review, y = reviews_per_month)) +
  geom_line() +
    labs(title = "Last Review Date of Airbnb Listings in NYC, 2019") +
  theme_minimal()

Generally, as reviews_per_month goes up, so does last_review (but it seems to shoot up during 2019).

Visualizing Part-Whole Relationships

I want to compare the differences in room_type between the priciest and cheapest neighbourhood_group (on average) using pie charts.

# group by neighbourhood_group, then calculate average airbnb price and find priciest neighbourhood_group.
airbnb %>% group_by(neighbourhood_group) %>% 
  summarise(mean_price = mean(price, na.rm = TRUE)) %>% 
  arrange(desc(mean_price)) %>% slice(1)

# find cheapest neighbourhood.
airbnb %>% group_by(neighbourhood_group) %>% 
  summarise(mean_price = mean(price, na.rm = TRUE)) %>% 
  arrange(mean_price) %>% slice(1)

# create subset.
sub <- airbnb %>% select(id, neighbourhood_group, room_type,) %>% 
  filter(neighbourhood_group == "Manhattan" | neighbourhood_group == "Bronx") %>% 
  group_by(room_type, neighbourhood_group) %>% 
  summarise(id, n = n())

# remove id column, then remove duplicates.
sub <- subset (sub, select = -id)
sub <- unique(sub)

Ok - the priciest neighbourhood_group is Manhattan, and the cheapest is the Bronx. We’ve also created a subset of the data that we need. Now let’s make the pie charts.

# creating pie chart 1.
sub_m <- sub %>% filter(`neighbourhood_group` == "Manhattan")
ggplot(sub_m, aes(x="", y=n, fill=room_type)) +
  geom_bar(stat="identity", width=1) +
  coord_polar("y", start=0) +
  theme_void() + 
  scale_fill_brewer(palette="Set2") +
  labs(title = "Airbnb Room Types in Manhattan (2019)")

# creating pie chart 2.
sub_b <- sub %>% filter(`neighbourhood_group` == "Bronx")
ggplot(sub_b, aes(x="", y=n, fill=room_type)) +
  geom_bar(stat="identity", width=1) +
  coord_polar("y", start=0) +
  theme_void() +
  scale_fill_brewer(palette="Set2") + 
  labs(title = "Airbnb Room Types in the Bronx (2019)")

In Manhattan, listings for full apartments were most common, while listings in the Bronx were overwhelmingly for private rooms. Shared rooms seem to be the least common in both neighbourhoods.