Challenge 6

challenge_6
Aritra Basu
air_bnb
Visualizing Time and Relationships
Author

Aritra Basu

Published

August 23, 2022

library(tidyverse)
library(readr)
library(summarytools)
library(ggplot2)
library(knitr)
library(lubridate)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Reading in the data

mydata <- read.csv("_data/AB_NYC_2019.csv")
glimpse(mydata)
Rows: 48,895
Columns: 16
$ id                             <int> 2539, 2595, 3647, 3831, 5022, 5099, 512…
$ name                           <chr> "Clean & quiet apt home by the park", "…
$ host_id                        <int> 2787, 2845, 4632, 4869, 7192, 7322, 735…
$ host_name                      <chr> "John", "Jennifer", "Elisabeth", "LisaR…
$ neighbourhood_group            <chr> "Brooklyn", "Manhattan", "Manhattan", "…
$ neighbourhood                  <chr> "Kensington", "Midtown", "Harlem", "Cli…
$ latitude                       <dbl> 40.64749, 40.75362, 40.80902, 40.68514,…
$ longitude                      <dbl> -73.97237, -73.98377, -73.94190, -73.95…
$ room_type                      <chr> "Private room", "Entire home/apt", "Pri…
$ price                          <int> 149, 225, 150, 89, 80, 200, 60, 79, 79,…
$ minimum_nights                 <int> 1, 1, 3, 1, 10, 3, 45, 2, 2, 1, 5, 2, 4…
$ number_of_reviews              <int> 9, 45, 0, 270, 9, 74, 49, 430, 118, 160…
$ last_review                    <chr> "2018-10-19", "2019-05-21", "", "2019-0…
$ reviews_per_month              <dbl> 0.21, 0.38, NA, 4.64, 0.10, 0.59, 0.40,…
$ calculated_host_listings_count <int> 6, 2, 1, 1, 1, 1, 1, 1, 1, 4, 1, 1, 3, …
$ availability_365               <int> 365, 355, 365, 194, 0, 129, 0, 220, 0, …
View(mydata)

Briefly describing the data

print(summarytools::dfSummary(mydata,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

mydata

Dimensions: 48895 x 16
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
id [integer]
Mean (sd) : 19017143 (10983108)
min ≤ med ≤ max:
2539 ≤ 19677284 ≤ 36487245
IQR (CV) : 19680234 (0.6)
48895 distinct values 0 (0.0%)
name [character]
1. Hillside Hotel
2. Home away from home
3. (Empty string)
4. New york Multi-unit build
5. Brooklyn Apartment
6. Loft Suite @ The Box Hous
7. Private Room
8. Artsy Private BR in Fort
9. Private room
10. Beautiful Brooklyn Browns
[ 47896 others ]
18 ( 0.0% )
17 ( 0.0% )
16 ( 0.0% )
16 ( 0.0% )
12 ( 0.0% )
11 ( 0.0% )
11 ( 0.0% )
10 ( 0.0% )
10 ( 0.0% )
8 ( 0.0% )
48766 ( 99.7% )
0 (0.0%)
host_id [integer]
Mean (sd) : 67620011 (78610967)
min ≤ med ≤ max:
2438 ≤ 30793816 ≤ 274321313
IQR (CV) : 99612390 (1.2)
37457 distinct values 0 (0.0%)
host_name [character]
1. Michael
2. David
3. Sonder (NYC)
4. John
5. Alex
6. Blueground
7. Sarah
8. Daniel
9. Jessica
10. Maria
[ 11443 others ]
417 ( 0.9% )
403 ( 0.8% )
327 ( 0.7% )
294 ( 0.6% )
279 ( 0.6% )
232 ( 0.5% )
227 ( 0.5% )
226 ( 0.5% )
205 ( 0.4% )
204 ( 0.4% )
46081 ( 94.2% )
0 (0.0%)
neighbourhood_group [character]
1. Bronx
2. Brooklyn
3. Manhattan
4. Queens
5. Staten Island
1091 ( 2.2% )
20104 ( 41.1% )
21661 ( 44.3% )
5666 ( 11.6% )
373 ( 0.8% )
0 (0.0%)
neighbourhood [character]
1. Williamsburg
2. Bedford-Stuyvesant
3. Harlem
4. Bushwick
5. Upper West Side
6. Hell's Kitchen
7. East Village
8. Upper East Side
9. Crown Heights
10. Midtown
[ 211 others ]
3920 ( 8.0% )
3714 ( 7.6% )
2658 ( 5.4% )
2465 ( 5.0% )
1971 ( 4.0% )
1958 ( 4.0% )
1853 ( 3.8% )
1798 ( 3.7% )
1564 ( 3.2% )
1545 ( 3.2% )
25449 ( 52.0% )
0 (0.0%)
latitude [numeric]
Mean (sd) : 40.7 (0.1)
min ≤ med ≤ max:
40.5 ≤ 40.7 ≤ 40.9
IQR (CV) : 0.1 (0)
19048 distinct values 0 (0.0%)
longitude [numeric]
Mean (sd) : -74 (0)
min ≤ med ≤ max:
-74.2 ≤ -74 ≤ -73.7
IQR (CV) : 0 (0)
14718 distinct values 0 (0.0%)
room_type [character]
1. Entire home/apt
2. Private room
3. Shared room
25409 ( 52.0% )
22326 ( 45.7% )
1160 ( 2.4% )
0 (0.0%)
price [integer]
Mean (sd) : 152.7 (240.2)
min ≤ med ≤ max:
0 ≤ 106 ≤ 10000
IQR (CV) : 106 (1.6)
674 distinct values 0 (0.0%)
minimum_nights [integer]
Mean (sd) : 7 (20.5)
min ≤ med ≤ max:
1 ≤ 3 ≤ 1250
IQR (CV) : 4 (2.9)
109 distinct values 0 (0.0%)
number_of_reviews [integer]
Mean (sd) : 23.3 (44.6)
min ≤ med ≤ max:
0 ≤ 5 ≤ 629
IQR (CV) : 23 (1.9)
394 distinct values 0 (0.0%)
last_review [character]
1. (Empty string)
2. 2019-06-23
3. 2019-07-01
4. 2019-06-30
5. 2019-06-24
6. 2019-07-07
7. 2019-07-02
8. 2019-06-22
9. 2019-06-16
10. 2019-07-05
[ 1755 others ]
10052 ( 20.6% )
1413 ( 2.9% )
1359 ( 2.8% )
1341 ( 2.7% )
875 ( 1.8% )
718 ( 1.5% )
658 ( 1.3% )
655 ( 1.3% )
601 ( 1.2% )
580 ( 1.2% )
30643 ( 62.7% )
0 (0.0%)
reviews_per_month [numeric]
Mean (sd) : 1.4 (1.7)
min ≤ med ≤ max:
0 ≤ 0.7 ≤ 58.5
IQR (CV) : 1.8 (1.2)
937 distinct values 10052 (20.6%)
calculated_host_listings_count [integer]
Mean (sd) : 7.1 (33)
min ≤ med ≤ max:
1 ≤ 1 ≤ 327
IQR (CV) : 1 (4.6)
47 distinct values 0 (0.0%)
availability_365 [integer]
Mean (sd) : 112.8 (131.6)
min ≤ med ≤ max:
0 ≤ 45 ≤ 365
IQR (CV) : 227 (1.2)
366 distinct values 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.2)
2023-04-23

The AB_NYC_2019 dataset describes listing activities of Airbnb properties in different boroughs of New York City, in New York for 2019. Each row contains the name of the listings, id, name of the host as well as information on rental types, geographical coordinates, prices, reviews and their availability in 2019.

Tidy Data

I want to plot relationships between the date of review and other variables.The variable last_review contains missing values. So I first retain observations that were reviewed. Then, I create month and year columns from the date variable.

After this, I select a subset of variables.

mydata2 <- mydata %>% 
  mutate(Date = ymd(last_review))%>% 
  drop_na(Date)%>% 
  mutate(day = day(Date), month = month(Date, label=TRUE), year = year(Date))

#I first select the required variables.
select_df<-mydata2 %>%
  select(id, neighbourhood_group:year)

I obtain the average number of reviews, and then only retain the listings that have number of reviews that are equal or higher than the average. Then I first group them by month and then by room type and month.

#Mean availability
summary_stats_numberofreviews <-select_df  %>% 
summarise (Mean_availability=mean(number_of_reviews, na.rm = TRUE))
 
#Grouping by month
summary_stats_month2 <-select_df  %>% 
filter (availability_365>0)  %>% 
  filter(price > quantile(price)[2] - 1.5 * IQR(price) & price < quantile(price)[4] + 1.5 * IQR(price))  %>% 
 group_by(year)  %>%
  summarise(
    Mean=mean(price, na.rm = TRUE),
    Quantile1 = quantile(price, c(0.25), q1 = c(0.25), na.rm = TRUE),
    Median=median(price, na.rm = TRUE),
    Quantile3 = quantile(price, c(0.75), q3 = c(0.75), na.rm = TRUE),
    SD=sd(price, na.rm = TRUE),
    min=min(price, na.rm = TRUE),
    max=max(price, na.rm = TRUE),
    )


#Grouping by month and room type
summary_stats_month3 <-select_df  %>% 
filter (availability_365>0)  %>% 
  filter(price > quantile(price)[2] - 1.5 * IQR(price) & price < quantile(price)[4] + 1.5 * IQR(price))  %>% 
group_by(year, room_type)  %>%
  summarise(
    Mean=mean(price, na.rm = TRUE),
    Quantile1 = quantile(price, c(0.25), q1 = c(0.25), na.rm = TRUE),
    Median=median(price, na.rm = TRUE),
    Quantile3 = quantile(price, c(0.75), q3 = c(0.75), na.rm = TRUE),
    SD=sd(price, na.rm = TRUE),
    min=min(price, na.rm = TRUE),
    max=max(price, na.rm = TRUE),
    )

Time dependent visualization:

The graph shows how the price in 2019 varies with the month of review.

ggplot(summary_stats_month2, aes(x = as.integer(year), y = Mean, group=1)) +
  geom_line(color = "#4C72B0") +
  geom_point(size = 3, color = "#4C72B0") +
  labs(title = "Price of Airbnb",
       x = "Last Review Year", y = "Mean price") +
  scale_x_continuous(breaks = seq(min(as.integer(summary_stats_month2$year)), max(as.integer(summary_stats_month2$year)), by = 1),
                     labels = seq(min(as.integer(summary_stats_month2$year)), max(as.integer(summary_stats_month2$year)), by = 1)) +
  theme_minimal()

Time dependent visualization with different categories:

ggplot(summary_stats_month3, aes(x = factor(year), y = Mean, fill = room_type)) +
  geom_bar(stat = "identity", position = position_dodge(width = 0.9)) +
  labs(x = "Year", y = "Mean Price", fill = "Room Type") +
  ggtitle("Summary Statistics by Year and Room Type") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_fill_brewer(palette = "Set2")

We can see that most recently reviewed airbnbs are cheaper compared to the ones that were last reviewed much before. One interesting pattern is the introduction of shared rooms that are much cheaper.

Continuing to play around

Next, I look at average availability for 2019, but I also take the type of listings into consideration:

select_df_type <- select_df %>% filter(year>=2019)%>% group_by(month, room_type) %>% summarise(count=n(), mean_availability=mean(availability_365)) %>%  ungroup()

ggplot(select_df_type, aes(x = month, y = mean_availability, color = room_type, group = room_type)) +
  geom_line() +
  geom_point() +
  labs(title = "Monthly Average Availability by Room Type",
       x = "Month",
       y = "Average Availability (in days)",
       color = "Room Type") +
  scale_color_manual(values = c("#F8766D", "#00BFC4", "#C77CFF")) +
  theme_minimal()

ggplot(select_df_type, aes(x = month, y = count, fill = room_type)) +
  geom_bar(stat = "identity", position = "stack") +
  labs(title = "Airbnbs in 2019: Monthly Room Type Counts",
       x = "Month",
       y = "Count",
       fill = "Room Type") +
  scale_fill_manual(values = c("#F8766D", "#00BFC4", "#C77CFF")) +
  theme_minimal()

Looking at availability by neighbourhood:

select_df_type2 <- select_df %>% filter(year>=2019)%>% group_by(month, neighbourhood_group ) %>% summarise(count=n(), mean_availability=mean(availability_365)) %>%  ungroup()

ggplot(select_df_type2, aes(x = month, y = count, fill = neighbourhood_group)) +
  geom_bar(stat = "identity", position = "stack") +
  labs(title = "Airbnbs in 2019: Monthly count by neighbourhood",
       x = "Month",
       y = "Count",
       fill = "Room Type") +
  scale_fill_manual(values = c("#F8766D", "#00BFC4", "#C77CFF", "#FFA500", "#FF69B4")) +
  theme_minimal()