Challenge 7 Erika Nagai

challenge_7

air_bnb

Visualizing Multiple Dimensions

Author

Erika Nagai

Published

October 31, 2022

library(tidyverse)
library(ggplot2)
library(summarytools)
library(knitr)
library(wordcloud)
library(ggwordcloud)

Error in library(ggwordcloud): there is no package called 'ggwordcloud'

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
tidy data (as needed, including sanity checks)
mutate variables as needed (including sanity checks)
Recreate at least two graphs from previous exercises, but introduce at least one additional dimension that you omitted before using ggplot functionality (color, shape, line, facet, etc) The goal is not to create unneeded chart ink (Tufte), but to concisely capture variation in additional dimensions that were collapsed in your earlier 2 or 3 dimensional graphs.

Explain why you choose the specific graph type

If you haven’t tried in previous weeks, work this week to make your graphs “publication” ready with titles, captions, and pretty axis labels and other viewer-friendly features

R Graph Gallery is a good starting point for thinking about what information is conveyed in standard graph types, and includes example R code. And anyone not familiar with Edward Tufte should check out his fantastic books and courses on data visualizaton.

(be sure to only include the category tags for the data you use!)

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

eggs ⭐
abc_poll ⭐⭐
australian_marriage ⭐⭐
hotel_bookings ⭐⭐⭐
air_bnb ⭐⭐⭐
us_hh ⭐⭐⭐⭐
faostat ⭐⭐⭐⭐⭐

Using read_csv function, I read in AB_NYC_2019.csv as “ab_df”

ab_df = read_csv("_data/AB_NYC_2019.csv")

Briefly describe the data

I explored this dataset to understand how it is structured and what kind of data is included

This dataset consists of 16 columns (variables) and 48895 rows. It includes the follwoing variables.

str(ab_df)

spc_tbl_ [48,895 × 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ id                            : num [1:48895] 2539 2595 3647 3831 5022 ...
 $ name                          : chr [1:48895] "Clean & quiet apt home by the park" "Skylit Midtown Castle" "THE VILLAGE OF HARLEM....NEW YORK !" "Cozy Entire Floor of Brownstone" ...
 $ host_id                       : num [1:48895] 2787 2845 4632 4869 7192 ...
 $ host_name                     : chr [1:48895] "John" "Jennifer" "Elisabeth" "LisaRoxanne" ...
 $ neighbourhood_group           : chr [1:48895] "Brooklyn" "Manhattan" "Manhattan" "Brooklyn" ...
 $ neighbourhood                 : chr [1:48895] "Kensington" "Midtown" "Harlem" "Clinton Hill" ...
 $ latitude                      : num [1:48895] 40.6 40.8 40.8 40.7 40.8 ...
 $ longitude                     : num [1:48895] -74 -74 -73.9 -74 -73.9 ...
 $ room_type                     : chr [1:48895] "Private room" "Entire home/apt" "Private room" "Entire home/apt" ...
 $ price                         : num [1:48895] 149 225 150 89 80 200 60 79 79 150 ...
 $ minimum_nights                : num [1:48895] 1 1 3 1 10 3 45 2 2 1 ...
 $ number_of_reviews             : num [1:48895] 9 45 0 270 9 74 49 430 118 160 ...
 $ last_review                   : Date[1:48895], format: "2018-10-19" "2019-05-21" ...
 $ reviews_per_month             : num [1:48895] 0.21 0.38 NA 4.64 0.1 0.59 0.4 3.47 0.99 1.33 ...
 $ calculated_host_listings_count: num [1:48895] 6 2 1 1 1 1 1 1 1 4 ...
 $ availability_365              : num [1:48895] 365 355 365 194 0 129 0 220 0 188 ...
 - attr(*, "spec")=
  .. cols(
  ..   id = col_double(),
  ..   name = col_character(),
  ..   host_id = col_double(),
  ..   host_name = col_character(),
  ..   neighbourhood_group = col_character(),
  ..   neighbourhood = col_character(),
  ..   latitude = col_double(),
  ..   longitude = col_double(),
  ..   room_type = col_character(),
  ..   price = col_double(),
  ..   minimum_nights = col_double(),
  ..   number_of_reviews = col_double(),
  ..   last_review = col_date(format = ""),
  ..   reviews_per_month = col_double(),
  ..   calculated_host_listings_count = col_double(),
  ..   availability_365 = col_double()
  .. )
 - attr(*, "problems")=<externalptr>

print(summarytools::dfSummary(ab_df,
                         varnumbers = FALSE,
                         plain.ascii  = FALSE,
                         style        = "grid",
                         graph.magnif = 0.80,
                        valid.col    = FALSE),
       method = 'render',
       table.classes = 'table-condensed')

Data Frame Summary

ab_df

Dimensions: 48895 x 16
Duplicates: 0

Variable

Stats / Values

Freqs (% of Valid)

Graph

Missing

id [numeric]

Mean (sd) : 19017143 (10983108)

min ≤ med ≤ max:

2539 ≤ 19677284 ≤ 36487245

IQR (CV) : 19680234 (0.6)

48895 distinct values

0 (0.0%)

name [character]

1. Hillside Hotel

2. Home away from home

3. New york Multi-unit build

4. Brooklyn Apartment

5. Loft Suite @ The Box Hous

6. Private Room

7. Artsy Private BR in Fort

8. Private room

9. Beautiful Brooklyn Browns

10. Cozy Brooklyn Apartment

[ 47884 others ]

18	(	0.0%	)
17	(	0.0%	)
16	(	0.0%	)
12	(	0.0%	)
11	(	0.0%	)
11	(	0.0%	)
10	(	0.0%	)
10	(	0.0%	)
8	(	0.0%	)
8	(	0.0%	)
48758	(	99.8%	)

16 (0.0%)

host_id [numeric]

Mean (sd) : 67620011 (78610967)

min ≤ med ≤ max:

2438 ≤ 30793816 ≤ 274321313

IQR (CV) : 99612390 (1.2)

37457 distinct values

0 (0.0%)

host_name [character]

1. Michael

2. David

3. Sonder (NYC)

4. John

5. Alex

6. Blueground

7. Sarah

8. Daniel

9. Jessica

10. Maria

[ 11442 others ]

417	(	0.9%	)
403	(	0.8%	)
327	(	0.7%	)
294	(	0.6%	)
279	(	0.6%	)
232	(	0.5%	)
227	(	0.5%	)
226	(	0.5%	)
205	(	0.4%	)
204	(	0.4%	)
46060	(	94.2%	)

21 (0.0%)

neighbourhood_group [character]

1. Bronx

2. Brooklyn

3. Manhattan

4. Queens

5. Staten Island

1091	(	2.2%	)
20104	(	41.1%	)
21661	(	44.3%	)
5666	(	11.6%	)
373	(	0.8%	)

0 (0.0%)

neighbourhood [character]

1. Williamsburg

2. Bedford-Stuyvesant

3. Harlem

4. Bushwick

5. Upper West Side

6. Hell's Kitchen

7. East Village

8. Upper East Side

9. Crown Heights

10. Midtown

[ 211 others ]

3920	(	8.0%	)
3714	(	7.6%	)
2658	(	5.4%	)
2465	(	5.0%	)
1971	(	4.0%	)
1958	(	4.0%	)
1853	(	3.8%	)
1798	(	3.7%	)
1564	(	3.2%	)
1545	(	3.2%	)
25449	(	52.0%	)

0 (0.0%)

latitude [numeric]

Mean (sd) : 40.7 (0.1)

min ≤ med ≤ max:

40.5 ≤ 40.7 ≤ 40.9

IQR (CV) : 0.1 (0)

19048 distinct values

0 (0.0%)

longitude [numeric]

Mean (sd) : -74 (0)

min ≤ med ≤ max:

-74.2 ≤ -74 ≤ -73.7

IQR (CV) : 0 (0)

14718 distinct values

0 (0.0%)

room_type [character]

1. Entire home/apt

2. Private room

3. Shared room

25409	(	52.0%	)
22326	(	45.7%	)
1160	(	2.4%	)

0 (0.0%)

price [numeric]

Mean (sd) : 152.7 (240.2)

min ≤ med ≤ max:

0 ≤ 106 ≤ 10000

IQR (CV) : 106 (1.6)

674 distinct values

0 (0.0%)

minimum_nights [numeric]

Mean (sd) : 7 (20.5)

min ≤ med ≤ max:

1 ≤ 3 ≤ 1250

IQR (CV) : 4 (2.9)

109 distinct values

0 (0.0%)

number_of_reviews [numeric]

Mean (sd) : 23.3 (44.6)

min ≤ med ≤ max:

0 ≤ 5 ≤ 629

IQR (CV) : 23 (1.9)

394 distinct values

0 (0.0%)

last_review [Date]

min : 2011-03-28

med : 2019-05-19

max : 2019-07-08

range : 8y 3m 10d

1764 distinct values

10052 (20.6%)

reviews_per_month [numeric]

Mean (sd) : 1.4 (1.7)

min ≤ med ≤ max:

0 ≤ 0.7 ≤ 58.5

IQR (CV) : 1.8 (1.2)

937 distinct values

10052 (20.6%)

calculated_host_listings_count [numeric]

Mean (sd) : 7.1 (33)

min ≤ med ≤ max:

1 ≤ 1 ≤ 327

IQR (CV) : 1 (4.6)

47 distinct values

0 (0.0%)

availability_365 [numeric]

Mean (sd) : 112.8 (131.6)

min ≤ med ≤ max:

0 ≤ 45 ≤ 365

IQR (CV) : 227 (1.2)

366 distinct values

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-20

AB_NYC_2019 documents the information on the accommodation in New York (as you can figure out from the name of the original csv file and the average longitude and latitude) that is registered on Air BnB as of sometime in 2019. The information includes 1) id and name of the accommodation 2) id and name of the host 3) geographic information of the accommodation (neighbourhood_group, latitude, longitude) 4) reservation-related information of the accommodation (room type, price, minimum night per stay) 5) information of the reviews (date of last review, total number of review, average number of review per month) 6) days available of the accommodation per year

The accommodations without any review have NA (missing value) in last_review and reviews_per_month as we can see that the number of missing values in last_review and reviews_per_month (10052) matches the number of accommodations whose number_of_review is 0

ab_df %>% filter(`number_of_reviews` == 0) %>%
  count()

# A tibble: 1 × 1
      n
  <int>
1 10052

This dataset is quite tidy because one row means one observation (one accommodation), however it can be separated into host_df dataframe and accommodation_df dataframe to make this dataset cleaner and easier to use.

###Accommodation_df

I created accommodation_df as follows removing host_name, calculated_host_listings_count columns. (* We should NOT remove host_id because it will be necessary if we want to join accomodation_df and host_df)

accommodation_df <- ab_df %>%
  select(-c(host_name, calculated_host_listings_count))

head(accommodation_df)

# A tibble: 6 × 14
     id name       host_id neigh…¹ neigh…² latit…³ longi…⁴ room_…⁵ price minim…⁶
  <dbl> <chr>        <dbl> <chr>   <chr>     <dbl>   <dbl> <chr>   <dbl>   <dbl>
1  2539 Clean & q…    2787 Brookl… Kensin…    40.6   -74.0 Privat…   149       1
2  2595 Skylit Mi…    2845 Manhat… Midtown    40.8   -74.0 Entire…   225       1
3  3647 THE VILLA…    4632 Manhat… Harlem     40.8   -73.9 Privat…   150       3
4  3831 Cozy Enti…    4869 Brookl… Clinto…    40.7   -74.0 Entire…    89       1
5  5022 Entire Ap…    7192 Manhat… East H…    40.8   -73.9 Entire…    80      10
6  5099 Large Coz…    7322 Manhat… Murray…    40.7   -74.0 Entire…   200       3
# … with 4 more variables: number_of_reviews <dbl>, last_review <date>,
#   reviews_per_month <dbl>, availability_365 <dbl>, and abbreviated variable
#   names ¹neighbourhood_group, ²neighbourhood, ³latitude, ⁴longitude,
#   ⁵room_type, ⁶minimum_nights

I realized that there are 17,541 observations with 0 as price and/or availability_365, which doesn’t make sense for an accommodation listing on AirBnB.

summary(accommodation_df)

       id               name              host_id          neighbourhood_group
 Min.   :    2539   Length:48895       Min.   :     2438   Length:48895       
 1st Qu.: 9471945   Class :character   1st Qu.:  7822033   Class :character   
 Median :19677284   Mode  :character   Median : 30793816   Mode  :character   
 Mean   :19017143                      Mean   : 67620011                      
 3rd Qu.:29152178                      3rd Qu.:107434423                      
 Max.   :36487245                      Max.   :274321313                      
                                                                              
 neighbourhood         latitude       longitude       room_type        
 Length:48895       Min.   :40.50   Min.   :-74.24   Length:48895      
 Class :character   1st Qu.:40.69   1st Qu.:-73.98   Class :character  
 Mode  :character   Median :40.72   Median :-73.96   Mode  :character  
                    Mean   :40.73   Mean   :-73.95                     
                    3rd Qu.:40.76   3rd Qu.:-73.94                     
                    Max.   :40.91   Max.   :-73.71                     
                                                                       
     price         minimum_nights    number_of_reviews  last_review        
 Min.   :    0.0   Min.   :   1.00   Min.   :  0.00    Min.   :2011-03-28  
 1st Qu.:   69.0   1st Qu.:   1.00   1st Qu.:  1.00    1st Qu.:2018-07-08  
 Median :  106.0   Median :   3.00   Median :  5.00    Median :2019-05-19  
 Mean   :  152.7   Mean   :   7.03   Mean   : 23.27    Mean   :2018-10-04  
 3rd Qu.:  175.0   3rd Qu.:   5.00   3rd Qu.: 24.00    3rd Qu.:2019-06-23  
 Max.   :10000.0   Max.   :1250.00   Max.   :629.00    Max.   :2019-07-08  
                                                       NA's   :10052       
 reviews_per_month availability_365
 Min.   : 0.010    Min.   :  0.0   
 1st Qu.: 0.190    1st Qu.:  0.0   
 Median : 0.720    Median : 45.0   
 Mean   : 1.373    Mean   :112.8   
 3rd Qu.: 2.020    3rd Qu.:227.0   
 Max.   :58.500    Max.   :365.0   
 NA's   :10052

accommodation_df %>% 
  select(c("name", "price", "availability_365")) %>%
  filter(price == 0 | availability_365 == 0)

# A tibble: 17,541 × 3
   name                                               price availability_365
   <chr>                                              <dbl>            <dbl>
 1 Entire Apt: Spacious Studio/Loft by central park      80                0
 2 BlissArtsSpace!                                       60                0
 3 Cozy Clean Guest Room - Family Apt                    79                0
 4 West Village Nest - Superhost                        120                0
 5 Sweet and Spacious Brooklyn Loft                     299                0
 6 Magnifique Suite au N de Manhattan - vue Cloitres     80                0
 7 bright and stylish duplex                            115                0
 8 Light-filled 2B duplex in the heart of Park Slope!   225                0
 9 Great Location for NYC                                50                0
10 Charming 1 bed GR8 WBurg LOCATION!                   100                0
# … with 17,531 more rows

These listings should be considered unavailable or at least not ready for receiving the travelers, so I created a new column outlier (1 = outlier, 0 = no outlier).

accommodation_df <- accommodation_df %>%
  mutate(outlier = case_when(.$price == 0|.$availability_365 == 0 ~ 1, 
                                  TRUE ~ 0))

Looking at minimum_nights, I noticed that some accommodations require a lot of days for the minimum nights.

I learned from the below table that 1) The minimum nights of 85% + listings are 7 nights or less. 2) The minimum nights of 98% + listings are 31 nights or less. 3) There are small peaks at 30 nights (3760), 60 nights (106), 90 nights (104), and 120 nights (28), inferring that some listings require 1,2,3, or 4 month stay.

accommodation_df %>%
  group_by(minimum_nights) %>%
  dplyr::summarize(
    frequency = n(),
  ) %>%
  mutate(
    cumulative_percentage = cumsum(frequency)/sum(frequency)*100
  ) %>%
  kable()

minimum_nights	frequency	cumulative_percentage
1	12720	26.01493
2	11696	49.93558
3	7999	66.29512
4	3303	73.05041
5	3034	79.25555
6	752	80.79354
7	2058	85.00256
8	130	85.26843
9	80	85.43205
10	483	86.41988
11	33	86.48737
12	91	86.67348
13	54	86.78392
14	562	87.93333
15	279	88.50394
16	18	88.54075
17	14	88.56938
18	28	88.62665
19	8	88.64301
20	223	89.09909
21	135	89.37519
22	8	89.39155
23	8	89.40791
24	14	89.43655
25	82	89.60425
26	16	89.63698
27	17	89.67175
28	203	90.08692
29	340	90.78229
30	3760	98.47224
31	201	98.88332
32	8	98.89968
33	2	98.90377
35	15	98.93445
36	1	98.93650
37	1	98.93854
39	1	98.94059
40	13	98.96717
42	1	98.96922
43	1	98.97127
44	2	98.97536
45	32	99.04080
47	2	99.04489
50	14	99.07352
53	2	99.07762
55	4	99.08580
56	2	99.08989
58	1	99.09193
59	6	99.10420
60	106	99.32099
62	3	99.32713
65	1	99.32917
68	1	99.33122
70	8	99.34758
74	1	99.34963
75	4	99.35781
80	9	99.37621
85	1	99.37826
87	1	99.38030
88	2	99.38440
90	104	99.59710
91	3	99.60323
93	2	99.60732
98	1	99.60937
99	2	99.61346
100	15	99.64414
105	2	99.64823
110	3	99.65436
114	1	99.65641
115	2	99.66050
120	28	99.71776
122	1	99.71981
133	2	99.72390
134	1	99.72594
150	6	99.73821
153	1	99.74026
160	2	99.74435
175	1	99.74640
180	43	99.83434
181	2	99.83843
182	1	99.84047
183	1	99.84252
184	1	99.84456
185	1	99.84661
186	1	99.84866
198	1	99.85070
200	4	99.85888
210	2	99.86297
222	1	99.86502
225	1	99.86706
240	3	99.87320
250	1	99.87524
265	1	99.87729
270	2	99.88138
275	1	99.88342
299	1	99.88547
300	6	99.89774
354	1	99.89979
360	5	99.91001
364	1	99.91206
365	29	99.97137
366	1	99.97341
370	1	99.97546
400	1	99.97750
480	1	99.97955
500	5	99.98977
999	3	99.99591
1000	1	99.99795
1250	1	100.00000

Thus, it is fairly safe to consider that the accommodation that require more than 31 days (1 month) for the minimum nights as outliers. Since these accommodations are not available for short-term travelers, I consider them as “outlier” (outlier = 1)

accommodation_df <- accommodation_df %>%
  mutate(outlier = case_when(.$price == 0|.$availability_365 == 0| minimum_nights > 31 ~ 1, 
                                  TRUE ~ 0))

accommodation_df

# A tibble: 48,895 × 15
      id name      host_id neigh…¹ neigh…² latit…³ longi…⁴ room_…⁵ price minim…⁶
   <dbl> <chr>       <dbl> <chr>   <chr>     <dbl>   <dbl> <chr>   <dbl>   <dbl>
 1  2539 Clean & …    2787 Brookl… Kensin…    40.6   -74.0 Privat…   149       1
 2  2595 Skylit M…    2845 Manhat… Midtown    40.8   -74.0 Entire…   225       1
 3  3647 THE VILL…    4632 Manhat… Harlem     40.8   -73.9 Privat…   150       3
 4  3831 Cozy Ent…    4869 Brookl… Clinto…    40.7   -74.0 Entire…    89       1
 5  5022 Entire A…    7192 Manhat… East H…    40.8   -73.9 Entire…    80      10
 6  5099 Large Co…    7322 Manhat… Murray…    40.7   -74.0 Entire…   200       3
 7  5121 BlissArt…    7356 Brookl… Bedfor…    40.7   -74.0 Privat…    60      45
 8  5178 Large Fu…    8967 Manhat… Hell's…    40.8   -74.0 Privat…    79       2
 9  5203 Cozy Cle…    7490 Manhat… Upper …    40.8   -74.0 Privat…    79       2
10  5238 Cute & C…    7549 Manhat… Chinat…    40.7   -74.0 Entire…   150       1
# … with 48,885 more rows, 5 more variables: number_of_reviews <dbl>,
#   last_review <date>, reviews_per_month <dbl>, availability_365 <dbl>,
#   outlier <dbl>, and abbreviated variable names ¹neighbourhood_group,
#   ²neighbourhood, ³latitude, ⁴longitude, ⁵room_type, ⁶minimum_nights

Some accommodations cost more than 1000 USD per night, which is extremely expensive. Looking at the below table, only less than 0.5% of accommodations charge 1000+ USD per night so I decided to consider those accommodations as outliers.

accommodation_df %>%
  group_by(price) %>%
  dplyr::summarize(
    frequency = n(),
  ) %>%
  mutate(
    cumulative_percentage = cumsum(frequency)/sum(frequency)*100
  ) %>%
  kable()

price	frequency	cumulative_percentage
0	11	0.0224972
10	17	0.0572656
11	3	0.0634012
12	4	0.0715820
13	1	0.0736272
15	6	0.0858984
16	6	0.0981695
18	2	0.1022599
19	4	0.1104407
20	33	0.1779323
21	6	0.1902035
22	16	0.2229267
23	7	0.2372431
24	13	0.2638307
25	100	0.4683505
26	24	0.5174353
27	29	0.5767461
28	55	0.6892320
29	67	0.8262604
30	255	1.3477861
31	46	1.4418652
32	82	1.6095715
33	93	1.7997750
34	84	1.9715717
35	480	2.9532672
36	103	3.1639227
37	132	3.4338889
38	167	3.7754372
39	196	4.1762961
40	771	5.7531445
41	86	5.9290316
42	192	6.3217098
43	113	6.5528173
44	119	6.7961959
45	891	8.6184681
46	102	8.8270784
47	163	9.1604459
48	203	9.5756212
49	345	10.2812148
50	1534	13.4185500
51	96	13.6148890
52	131	13.8828101
53	118	14.1241436
54	112	14.3532058
55	950	16.2961448
56	74	16.4474895
57	109	16.6704162
58	157	16.9915124
59	297	17.5989365
60	1458	20.5808365
61	54	20.6912772
62	140	20.9776051
63	108	21.1984866
64	90	21.3825545
65	1190	23.8163411
66	87	23.9942734
67	107	24.2131097
68	160	24.5403415
69	372	25.3011555
70	1170	27.6940382
71	60	27.8167502
72	124	28.0703548
73	84	28.2421515
74	68	28.3812251
75	1370	31.1831476
76	68	31.3222211
77	90	31.5062890
78	122	31.7558033
79	299	32.3673177
80	1272	34.9688107
81	51	35.0731159
82	71	35.2183250
83	63	35.3471725
84	75	35.5005624
85	877	37.2942019
86	79	37.4557726
87	83	37.6255241
88	115	37.8607220
89	365	38.6072196
90	1021	40.6953676
91	52	40.8017180
92	61	40.9264751
93	64	41.0573678
94	68	41.1964414
95	700	42.6280806
96	64	42.7589733
97	91	42.9450864
98	137	43.2252787
99	742	44.7428162
100	2051	48.9375192
101	46	49.0315983
102	45	49.1236323
103	49	49.2238470
104	44	49.3138358
105	325	49.9785254
106	35	50.0501074
107	42	50.1360057
108	54	50.2464465
109	152	50.5573167
110	739	52.0687187
111	38	52.1464362
112	53	52.2548318
113	42	52.3407301
114	28	52.3979957
115	468	53.3551488
116	34	53.4246856
117	53	53.5330811
118	57	53.6496574
119	191	54.0402904
120	1130	56.3513652
121	23	56.3984047
122	41	56.4822579
123	44	56.5722467
124	47	56.6683710
125	1057	58.8301462
126	33	58.8976378
127	27	58.9528582
128	60	59.0755701
129	207	59.4989263
130	610	60.7464976
131	24	60.7955824
132	36	60.8692095
133	53	60.9776051
134	36	61.0512322
135	415	61.8999898
136	24	61.9490745
137	35	62.0206565
138	38	62.0983741
139	170	62.4460579
140	548	63.5668269
141	11	63.5893241
142	55	63.7018100
143	24	63.7508948
144	31	63.8142959
145	323	64.4748952
146	25	64.5260252
147	32	64.5914715
148	50	64.6937315
149	278	65.2622968
150	2047	69.4488189
151	35	69.5204009
152	24	69.5694856
153	21	69.6124348
154	34	69.6819716
155	183	70.0562430
156	21	70.0991921
157	20	70.1400961
158	37	70.2157685
159	164	70.5511811
160	480	71.5328766
161	15	71.5635546
162	22	71.6085489
163	22	71.6535433
164	23	71.7005829
165	315	72.3448205
166	20	72.3857245
167	23	72.4327641
168	34	72.5023008
169	123	72.7538603
170	295	73.3571940
171	12	73.3817364
172	28	73.4390019
173	16	73.4717251
174	31	73.5351263
175	763	75.0956130
176	19	75.1344718
177	30	75.1958278
178	23	75.2428674
179	143	75.5353308
180	522	76.6029246
181	14	76.6315574
182	21	76.6745066
183	9	76.6929134
184	24	76.7419982
185	236	77.2246651
186	18	77.2614787
187	24	77.3105635
188	30	77.3719194
189	123	77.6234789
190	292	78.2206770
191	13	78.2472645
192	10	78.2677165
193	5	78.2779425
194	15	78.3086205
195	293	78.9078638
196	27	78.9630842
197	28	79.0203497
198	58	79.1389713
199	415	79.9877288
200	1401	82.8530525
201	8	82.8694141
202	15	82.9000920
203	9	82.9184988
204	7	82.9328152
205	50	83.0350752
206	15	83.0657531
207	12	83.0902955
208	13	83.1168831
209	46	83.2109623
210	133	83.4829737
211	12	83.5075161
212	14	83.5361489
213	3	83.5422845
214	14	83.5709173
215	111	83.7979343
216	4	83.8061151
217	12	83.8306575
218	12	83.8551999
219	56	83.9697311
220	254	84.4892116
221	8	84.5055732
222	14	84.5342060
223	6	84.5464771
224	13	84.5730647
225	451	85.4954494
226	12	85.5199918
227	14	85.5486246
228	23	85.5956642
229	64	85.7265569
230	109	85.9494836
231	6	85.9617548
232	10	85.9822068
233	11	86.0047040
234	12	86.0292463
235	66	86.1642295
236	10	86.1846815
237	6	86.1969527
238	6	86.2092238
239	106	86.4260149
240	114	86.6591676
241	9	86.6775744
242	4	86.6857552
243	10	86.7062072
244	15	86.7368852
245	93	86.9270887
246	7	86.9414051
247	11	86.9639022
248	15	86.9945802
249	133	87.2665917
250	1018	89.3486042
251	5	89.3588301
252	7	89.3731465
253	3	89.3792821
254	7	89.3935985
255	42	89.4794969
256	8	89.4958585
257	5	89.5060845
258	5	89.5163105
259	30	89.5776664
260	81	89.7433275
261	5	89.7535535
262	8	89.7699151
263	5	89.7801411
264	12	89.8046835
265	58	89.9233050
266	1	89.9253502
267	9	89.9437570
268	6	89.9560282
269	33	90.0235198
270	42	90.1094181
271	10	90.1298701
272	4	90.1380509
273	5	90.1482769
274	9	90.1666837
275	244	90.6657122
276	8	90.6820738
277	3	90.6882094
278	15	90.7188874
279	23	90.7659270
280	91	90.9520401
281	2	90.9561305
282	5	90.9663565
283	5	90.9765825
284	2	90.9806729
285	65	91.1136108
287	7	91.1279272
288	12	91.1524696
289	38	91.2301871
290	63	91.3590347
291	3	91.3651703
292	1	91.3672155
293	3	91.3733511
294	6	91.3856223
295	114	91.6187749
296	1	91.6208201
297	5	91.6310461
298	13	91.6576337
299	161	91.9869107
300	561	93.1342673
301	1	93.1363125
302	7	93.1506289
303	7	93.1649453
304	3	93.1710809
305	23	93.2181205
306	6	93.2303917
307	5	93.2406177
308	7	93.2549340
309	7	93.2692504
310	21	93.3121996
311	2	93.3162900
312	9	93.3346968
313	1	93.3367420
314	9	93.3551488
315	19	93.3940076
316	13	93.4205952
317	4	93.4287759
318	1	93.4308211
319	8	93.4471827
320	39	93.5269455
321	6	93.5392167
322	1	93.5412619
323	2	93.5453523
324	2	93.5494427
325	120	93.7948666
327	1	93.7969117
328	3	93.8030473
329	13	93.8296349
330	27	93.8848553
331	1	93.8869005
332	3	93.8930361
333	9	93.9114429
334	5	93.9216689
335	8	93.9380305
336	1	93.9400757
337	2	93.9441661
338	1	93.9462113
339	16	93.9789345
340	18	94.0157480
341	4	94.0239288
342	1	94.0259740
343	4	94.0341548
344	1	94.0362000
345	25	94.0873300
346	2	94.0914204
347	4	94.0996012
348	3	94.1057368
349	45	94.1977707
350	381	94.9769915
352	4	94.9851723
353	1	94.9872175
355	10	95.0076695
356	2	95.0117599
358	7	95.0260763
359	13	95.0526639
360	21	95.0956130
361	1	95.0976582
362	1	95.0997034
365	18	95.1365170
367	2	95.1406074
369	19	95.1794662
370	13	95.2060538
372	2	95.2101442
373	1	95.2121894
374	6	95.2244606
375	72	95.3717149
376	3	95.3778505
377	5	95.3880765
378	3	95.3942121
379	16	95.4269353
380	36	95.5005624
381	1	95.5026076
382	3	95.5087432
383	1	95.5107884
385	14	95.5394212
386	1	95.5414664
387	2	95.5455568
388	3	95.5516924
389	10	95.5721444
390	24	95.6212292
392	4	95.6294100
393	2	95.6335004
394	2	95.6375908
395	41	95.7214439
396	6	95.7337151
397	1	95.7357603
398	8	95.7521219
399	89	95.9341446
400	225	96.3943143
402	1	96.3963595
403	2	96.4004499
404	2	96.4045403
407	2	96.4086307
408	3	96.4147663
409	5	96.4249923
410	6	96.4372635
412	5	96.4474895
414	2	96.4515799
415	5	96.4618059
416	4	96.4699867
417	1	96.4720319
418	2	96.4761223
419	5	96.4863483
420	13	96.5129359
421	2	96.5170263
422	1	96.5190715
423	1	96.5211167
424	1	96.5231619
425	50	96.6254218
426	1	96.6274670
428	1	96.6295122
429	5	96.6397382
430	6	96.6520094
431	1	96.6540546
432	2	96.6581450
433	2	96.6622354
434	1	96.6642806
435	3	96.6704162
436	1	96.6724614
438	2	96.6765518
439	2	96.6806422
440	4	96.6888230
442	1	96.6908682
443	3	96.6970038
444	2	96.7010942
445	2	96.7051846
446	1	96.7072298
447	2	96.7113202
449	12	96.7358626
450	156	97.0549136
451	2	97.0590040
452	1	97.0610492
453	1	97.0630944
454	1	97.0651396
455	4	97.0733204
459	5	97.0835464
460	7	97.0978628
462	1	97.0999080
463	1	97.1019532
465	7	97.1162696
466	2	97.1203600
468	2	97.1244504
469	1	97.1264956
470	4	97.1346763
471	1	97.1367215
472	3	97.1428571
474	1	97.1449023
475	23	97.1919419
477	1	97.1939871
478	2	97.1980775
479	5	97.2083035
480	8	97.2246651
481	1	97.2267103
483	1	97.2287555
484	2	97.2328459
485	7	97.2471623
487	2	97.2512527
488	2	97.2553431
489	7	97.2696595
490	7	97.2839759
495	25	97.3351058
496	1	97.3371510
497	1	97.3391962
498	1	97.3412414
499	65	97.4741794
500	191	97.8648124
503	2	97.8689028
505	3	97.8750383
506	1	97.8770835
509	1	97.8791287
510	2	97.8832191
515	3	97.8893547
519	1	97.8913999
520	5	97.9016259
524	1	97.9036711
525	15	97.9343491
530	2	97.9384395
531	1	97.9404847
534	7	97.9548011
539	1	97.9568463
540	3	97.9629819
545	5	97.9732079
549	5	97.9834339
550	51	98.0877390
555	1	98.0897842
559	1	98.0918294
560	5	98.1020554
561	1	98.1041006
573	1	98.1061458
575	12	98.1306882
577	2	98.1347786
578	1	98.1368238
580	1	98.1388690
583	2	98.1429594
584	1	98.1450046
585	5	98.1552306
587	1	98.1572758
589	2	98.1613662
590	8	98.1777278
595	13	98.2043154
599	19	98.2431741
600	81	98.4088353
612	1	98.4108805
615	2	98.4149709
616	1	98.4170161
619	1	98.4190613
620	2	98.4231517
625	4	98.4313324
639	1	98.4333776
640	2	98.4374680
643	1	98.4395132
647	1	98.4415584
649	7	98.4558748
650	55	98.5683608
655	1	98.5704060
659	1	98.5724512
670	2	98.5765416
672	2	98.5806320
675	13	98.6072196
680	1	98.6092648
688	1	98.6113099
689	1	98.6133551
690	4	98.6215359
692	1	98.6235811
695	2	98.6276715
699	35	98.6992535
700	47	98.7953779
714	12	98.8199202
718	1	98.8219654
720	1	98.8240106
725	1	98.8260558
737	10	98.8465078
745	2	98.8505982
748	7	98.8649146
749	2	98.8690050
750	45	98.9610390
760	2	98.9651294
766	1	98.9671746
780	1	98.9692198
785	2	98.9733102
789	1	98.9753554
790	2	98.9794458
794	1	98.9814909
795	6	98.9937621
799	18	99.0305757
800	54	99.1410165
805	1	99.1430617
814	1	99.1451069
820	1	99.1471521
822	1	99.1491973
825	2	99.1532877
830	1	99.1553329
840	2	99.1594233
843	1	99.1614685
848	1	99.1635137
850	19	99.2023724
860	2	99.2064628
875	1	99.2085080
888	1	99.2105532
890	3	99.2166888
894	1	99.2187340
895	3	99.2248696
898	1	99.2269148
899	5	99.2371408
900	20	99.2780448
913	1	99.2800900
920	1	99.2821352
930	1	99.2841804
935	1	99.2862256
945	3	99.2923612
949	2	99.2964516
950	11	99.3189488
956	3	99.3250844
974	1	99.3271296
975	4	99.3353104
980	2	99.3394008
985	1	99.3414460
990	1	99.3434912
995	6	99.3557623
999	17	99.3905307
1000	59	99.5111975
1002	1	99.5132427
1020	1	99.5152879
1046	1	99.5173331
1050	3	99.5234687
1066	1	99.5255139
1067	1	99.5275591
1075	1	99.5296043
1080	1	99.5316495
1095	1	99.5336947
1099	1	99.5357399
1100	12	99.5602822
1115	1	99.5623274
1145	1	99.5643726
1150	2	99.5684630
1170	1	99.5705082
1177	1	99.5725534
1195	2	99.5766438
1200	15	99.6073218
1250	6	99.6195930
1295	1	99.6216382
1299	1	99.6236834
1300	3	99.6298190
1306	1	99.6318642
1308	1	99.6339094
1315	1	99.6359546
1333	1	99.6379998
1350	2	99.6420902
1369	1	99.6441354
1395	1	99.6461806
1400	2	99.6502710
1450	1	99.6523162
1475	1	99.6543614
1485	1	99.6564066
1494	1	99.6584518
1495	1	99.6604970
1497	1	99.6625422
1500	26	99.7157174
1550	1	99.7177626
1599	2	99.7218530
1600	4	99.7300337
1680	1	99.7320789
1700	4	99.7402597
1731	1	99.7423049
1749	1	99.7443501
1750	3	99.7504857
1763	1	99.7525309
1795	1	99.7545761
1799	1	99.7566213
1800	2	99.7607117
1880	1	99.7627569
1899	1	99.7648021
1900	1	99.7668473
1978	1	99.7688925
1999	3	99.7750281
2000	24	99.8241129
2010	1	99.8261581
2100	1	99.8282033
2103	1	99.8302485
2200	1	99.8322937
2250	2	99.8363841
2300	1	99.8384293
2350	1	99.8404745
2400	1	99.8425197
2500	11	99.8650169
2545	1	99.8670621
2590	1	99.8691073
2600	1	99.8711525
2626	1	99.8731977
2695	1	99.8752429
2750	1	99.8772881
2800	1	99.8793333
2850	1	99.8813785
2900	1	99.8834237
2990	1	99.8854689
2995	1	99.8875141
2999	2	99.8916045
3000	8	99.9079660
3200	1	99.9100112
3210	1	99.9120564
3512	1	99.9141016
3518	1	99.9161468
3600	1	99.9181920
3613	1	99.9202372
3750	2	99.9243276
3800	1	99.9263728
3900	1	99.9284180
4000	3	99.9345536
4100	1	99.9365988
4160	1	99.9386440
4200	2	99.9427344
4500	2	99.9468248
5000	6	99.9590960
5100	1	99.9611412
5250	1	99.9631864
6000	2	99.9672768
6419	1	99.9693220
6500	3	99.9754576
6800	1	99.9775028
7500	2	99.9815932
7703	1	99.9836384
8000	1	99.9856836
8500	1	99.9877288
9999	3	99.9938644
10000	3	100.0000000

accommodation_df <- accommodation_df %>%
  mutate(outlier = case_when(.$price == 0|.$price > 1000|.$availability_365 == 0| minimum_nights > 31 ~ 1, 
                                  TRUE ~ 0))

accommodation_df

# A tibble: 48,895 × 15
      id name      host_id neigh…¹ neigh…² latit…³ longi…⁴ room_…⁵ price minim…⁶
   <dbl> <chr>       <dbl> <chr>   <chr>     <dbl>   <dbl> <chr>   <dbl>   <dbl>
 1  2539 Clean & …    2787 Brookl… Kensin…    40.6   -74.0 Privat…   149       1
 2  2595 Skylit M…    2845 Manhat… Midtown    40.8   -74.0 Entire…   225       1
 3  3647 THE VILL…    4632 Manhat… Harlem     40.8   -73.9 Privat…   150       3
 4  3831 Cozy Ent…    4869 Brookl… Clinto…    40.7   -74.0 Entire…    89       1
 5  5022 Entire A…    7192 Manhat… East H…    40.8   -73.9 Entire…    80      10
 6  5099 Large Co…    7322 Manhat… Murray…    40.7   -74.0 Entire…   200       3
 7  5121 BlissArt…    7356 Brookl… Bedfor…    40.7   -74.0 Privat…    60      45
 8  5178 Large Fu…    8967 Manhat… Hell's…    40.8   -74.0 Privat…    79       2
 9  5203 Cozy Cle…    7490 Manhat… Upper …    40.8   -74.0 Privat…    79       2
10  5238 Cute & C…    7549 Manhat… Chinat…    40.7   -74.0 Entire…   150       1
# … with 48,885 more rows, 5 more variables: number_of_reviews <dbl>,
#   last_review <date>, reviews_per_month <dbl>, availability_365 <dbl>,
#   outlier <dbl>, and abbreviated variable names ¹neighbourhood_group,
#   ²neighbourhood, ³latitude, ⁴longitude, ⁵room_type, ⁶minimum_nights

###host_df.

host_df <- ab_df %>%
  select(c(host_id, host_name, calculated_host_listings_count))

head(host_df)

# A tibble: 6 × 3
  host_id host_name   calculated_host_listings_count
    <dbl> <chr>                                <dbl>
1    2787 John                                     6
2    2845 Jennifer                                 2
3    4632 Elisabeth                                1
4    4869 LisaRoxanne                              1
5    7192 Laura                                    1
6    7322 Chris                                    1

host_df needs to be cleaned because it has several duplicated information. For example, host_id should be unique however the same host_id appear multiple times as the below table shows.

host_df %>% group_by(host_id) %>%
  dplyr::summarise(number =n()) %>%
  arrange(desc(number))

# A tibble: 37,457 × 2
     host_id number
       <dbl>  <int>
 1 219517861    327
 2 107434423    232
 3  30283594    121
 4 137358866    103
 5  12243051     96
 6  16098958     96
 7  61391963     91
 8  22541573     87
 9 200380610     65
10   1475015     52
# … with 37,447 more rows

So I removed duplicated rows by using distinct() function

host_df <- host_df %>% distinct(host_id, .keep_all = TRUE) %>%
  arrange(desc(calculated_host_listings_count))

head(host_df)

# A tibble: 6 × 3
    host_id host_name      calculated_host_listings_count
      <dbl> <chr>                                   <dbl>
1 219517861 Sonder (NYC)                              327
2 107434423 Blueground                                232
3  30283594 Kara                                      121
4 137358866 Kazuya                                    103
5  16098958 Jeremy & Laura                             96
6  12243051 Sonder                                     96

Also, I renamed the column name calculated_host_listings_count to make it easier to understand

colnames(host_df)[3] <- "total_accommodation_count"
head(host_df)

# A tibble: 6 × 3
    host_id host_name      total_accommodation_count
      <dbl> <chr>                              <dbl>
1 219517861 Sonder (NYC)                         327
2 107434423 Blueground                           232
3  30283594 Kara                                 121
4 137358866 Kazuya                               103
5  16098958 Jeremy & Laura                        96
6  12243051 Sonder                                96

print(summarytools::dfSummary(host_df,
                         varnumbers = FALSE,
                         plain.ascii  = FALSE,
                         style        = "grid",
                         graph.magnif = 0.80,
                        valid.col    = FALSE),
       method = 'render',
       table.classes = 'table-condensed')

Data Frame Summary

host_df

Dimensions: 37457 x 3
Duplicates: 0

Variable

Stats / Values

Freqs (% of Valid)

Graph

Missing

host_id [numeric]

Mean (sd) : 61445493 (75117502)

min ≤ med ≤ max:

2438 ≤ 26790498 ≤ 274321313

IQR (CV) : 81121712 (1.2)

37457 distinct values

0 (0.0%)

host_name [character]

1. Michael

2. David

3. Alex

4. Daniel

5. Sarah

6. John

7. Andrew

8. Chris

9. Maria

10. Anna

[ 11442 others ]

335	(	0.9%	)
318	(	0.8%	)
226	(	0.6%	)
208	(	0.6%	)
199	(	0.5%	)
188	(	0.5%	)
163	(	0.4%	)
162	(	0.4%	)
161	(	0.4%	)
156	(	0.4%	)
35323	(	94.3%	)

18 (0.0%)

total_accommodation_count [numeric]

Mean (sd) : 1.3 (2.8)

min ≤ med ≤ max:

1 ≤ 1 ≤ 327

IQR (CV) : 0 (2.1)

47 distinct values

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-20

Visualization with Multiple Dimensions

What are the characteristics of each neighborhood?

First, I decided to visualize the number of accommodation listings by room types and neighborhood groups.

Most listings are located in Manhattan or Brooklyn.
There are few listings of shared rooms compared to entire home and private rooms.

accommodation_df %>% 
  filter(outlier==0) %>%
  group_by(neighbourhood_group, room_type)%>%
  dplyr::summarize(
    frequency = n()
  ) %>%
  ggplot(aes(fill=room_type, y=frequency, x=neighbourhood_group)) + geom_bar(position="stack", stat="identity") +
  labs(title = "The number of accommodation listings by room types and neibhgourhood groups")

I decided to visualize the price data by neighbourhood group with a violin plot because it allows us to see the distribution and the volume for several groups.

I learned that, 1) Manhattan is the most expensive neighbourhood regardless of room types. 2) Even though Manhattan and Brooklyn are typically more expensive than other neighbourhoods, they offer more reasonable entire home/apartments as well.

accommodation_df %>% filter(outlier==0) %>%
  ggplot(aes(neighbourhood_group, price)) +
  geom_violin() +
  facet_wrap(vars(room_type)) + 
  theme(axis.text.x = element_text(angle = 45)) +
  labs(title = "Price per night by room types and neighbourhood groups") +
  coord_flip()

accommodation_df %>% filter(outlier == 0) %>%
  ggplot(aes(reviews_per_month)) + 
  geom_boxplot() + 
  facet_grid(neighbourhood_group~room_type) +
  scale_x_continuous(limits=c(0,30), breaks=seq(0,30,5)) +
  labs(title = "The number of reviews per month by room types and neighbourhood groups")