DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 7 Erika Nagai

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Challenge Overview
  • Read in data
    • Briefly describe the data
  • Visualization with Multiple Dimensions

Challenge 7 Erika Nagai

challenge_7
air_bnb
Visualizing Multiple Dimensions
Author

Erika Nagai

Published

October 31, 2022

library(tidyverse)
library(ggplot2)
library(summarytools)
library(knitr)
library(wordcloud)
library(ggwordcloud)
Error in library(ggwordcloud): there is no package called 'ggwordcloud'
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. tidy data (as needed, including sanity checks)
  3. mutate variables as needed (including sanity checks)
  4. Recreate at least two graphs from previous exercises, but introduce at least one additional dimension that you omitted before using ggplot functionality (color, shape, line, facet, etc) The goal is not to create unneeded chart ink (Tufte), but to concisely capture variation in additional dimensions that were collapsed in your earlier 2 or 3 dimensional graphs.
  • Explain why you choose the specific graph type
  1. If you haven’t tried in previous weeks, work this week to make your graphs “publication” ready with titles, captions, and pretty axis labels and other viewer-friendly features

R Graph Gallery is a good starting point for thinking about what information is conveyed in standard graph types, and includes example R code. And anyone not familiar with Edward Tufte should check out his fantastic books and courses on data visualizaton.

(be sure to only include the category tags for the data you use!)

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

  • eggs ⭐
  • abc_poll ⭐⭐
  • australian_marriage ⭐⭐
  • hotel_bookings ⭐⭐⭐
  • air_bnb ⭐⭐⭐
  • us_hh ⭐⭐⭐⭐
  • faostat ⭐⭐⭐⭐⭐

Using read_csv function, I read in AB_NYC_2019.csv as “ab_df”

ab_df = read_csv("_data/AB_NYC_2019.csv")

Briefly describe the data

I explored this dataset to understand how it is structured and what kind of data is included

This dataset consists of 16 columns (variables) and 48895 rows. It includes the follwoing variables.

str(ab_df)
spc_tbl_ [48,895 × 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ id                            : num [1:48895] 2539 2595 3647 3831 5022 ...
 $ name                          : chr [1:48895] "Clean & quiet apt home by the park" "Skylit Midtown Castle" "THE VILLAGE OF HARLEM....NEW YORK !" "Cozy Entire Floor of Brownstone" ...
 $ host_id                       : num [1:48895] 2787 2845 4632 4869 7192 ...
 $ host_name                     : chr [1:48895] "John" "Jennifer" "Elisabeth" "LisaRoxanne" ...
 $ neighbourhood_group           : chr [1:48895] "Brooklyn" "Manhattan" "Manhattan" "Brooklyn" ...
 $ neighbourhood                 : chr [1:48895] "Kensington" "Midtown" "Harlem" "Clinton Hill" ...
 $ latitude                      : num [1:48895] 40.6 40.8 40.8 40.7 40.8 ...
 $ longitude                     : num [1:48895] -74 -74 -73.9 -74 -73.9 ...
 $ room_type                     : chr [1:48895] "Private room" "Entire home/apt" "Private room" "Entire home/apt" ...
 $ price                         : num [1:48895] 149 225 150 89 80 200 60 79 79 150 ...
 $ minimum_nights                : num [1:48895] 1 1 3 1 10 3 45 2 2 1 ...
 $ number_of_reviews             : num [1:48895] 9 45 0 270 9 74 49 430 118 160 ...
 $ last_review                   : Date[1:48895], format: "2018-10-19" "2019-05-21" ...
 $ reviews_per_month             : num [1:48895] 0.21 0.38 NA 4.64 0.1 0.59 0.4 3.47 0.99 1.33 ...
 $ calculated_host_listings_count: num [1:48895] 6 2 1 1 1 1 1 1 1 4 ...
 $ availability_365              : num [1:48895] 365 355 365 194 0 129 0 220 0 188 ...
 - attr(*, "spec")=
  .. cols(
  ..   id = col_double(),
  ..   name = col_character(),
  ..   host_id = col_double(),
  ..   host_name = col_character(),
  ..   neighbourhood_group = col_character(),
  ..   neighbourhood = col_character(),
  ..   latitude = col_double(),
  ..   longitude = col_double(),
  ..   room_type = col_character(),
  ..   price = col_double(),
  ..   minimum_nights = col_double(),
  ..   number_of_reviews = col_double(),
  ..   last_review = col_date(format = ""),
  ..   reviews_per_month = col_double(),
  ..   calculated_host_listings_count = col_double(),
  ..   availability_365 = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 
print(summarytools::dfSummary(ab_df,
                         varnumbers = FALSE,
                         plain.ascii  = FALSE,
                         style        = "grid",
                         graph.magnif = 0.80,
                        valid.col    = FALSE),
       method = 'render',
       table.classes = 'table-condensed')

Data Frame Summary

ab_df

Dimensions: 48895 x 16
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
id [numeric]
Mean (sd) : 19017143 (10983108)
min ≤ med ≤ max:
2539 ≤ 19677284 ≤ 36487245
IQR (CV) : 19680234 (0.6)
48895 distinct values 0 (0.0%)
name [character]
1. Hillside Hotel
2. Home away from home
3. New york Multi-unit build
4. Brooklyn Apartment
5. Loft Suite @ The Box Hous
6. Private Room
7. Artsy Private BR in Fort
8. Private room
9. Beautiful Brooklyn Browns
10. Cozy Brooklyn Apartment
[ 47884 others ]
18(0.0%)
17(0.0%)
16(0.0%)
12(0.0%)
11(0.0%)
11(0.0%)
10(0.0%)
10(0.0%)
8(0.0%)
8(0.0%)
48758(99.8%)
16 (0.0%)
host_id [numeric]
Mean (sd) : 67620011 (78610967)
min ≤ med ≤ max:
2438 ≤ 30793816 ≤ 274321313
IQR (CV) : 99612390 (1.2)
37457 distinct values 0 (0.0%)
host_name [character]
1. Michael
2. David
3. Sonder (NYC)
4. John
5. Alex
6. Blueground
7. Sarah
8. Daniel
9. Jessica
10. Maria
[ 11442 others ]
417(0.9%)
403(0.8%)
327(0.7%)
294(0.6%)
279(0.6%)
232(0.5%)
227(0.5%)
226(0.5%)
205(0.4%)
204(0.4%)
46060(94.2%)
21 (0.0%)
neighbourhood_group [character]
1. Bronx
2. Brooklyn
3. Manhattan
4. Queens
5. Staten Island
1091(2.2%)
20104(41.1%)
21661(44.3%)
5666(11.6%)
373(0.8%)
0 (0.0%)
neighbourhood [character]
1. Williamsburg
2. Bedford-Stuyvesant
3. Harlem
4. Bushwick
5. Upper West Side
6. Hell's Kitchen
7. East Village
8. Upper East Side
9. Crown Heights
10. Midtown
[ 211 others ]
3920(8.0%)
3714(7.6%)
2658(5.4%)
2465(5.0%)
1971(4.0%)
1958(4.0%)
1853(3.8%)
1798(3.7%)
1564(3.2%)
1545(3.2%)
25449(52.0%)
0 (0.0%)
latitude [numeric]
Mean (sd) : 40.7 (0.1)
min ≤ med ≤ max:
40.5 ≤ 40.7 ≤ 40.9
IQR (CV) : 0.1 (0)
19048 distinct values 0 (0.0%)
longitude [numeric]
Mean (sd) : -74 (0)
min ≤ med ≤ max:
-74.2 ≤ -74 ≤ -73.7
IQR (CV) : 0 (0)
14718 distinct values 0 (0.0%)
room_type [character]
1. Entire home/apt
2. Private room
3. Shared room
25409(52.0%)
22326(45.7%)
1160(2.4%)
0 (0.0%)
price [numeric]
Mean (sd) : 152.7 (240.2)
min ≤ med ≤ max:
0 ≤ 106 ≤ 10000
IQR (CV) : 106 (1.6)
674 distinct values 0 (0.0%)
minimum_nights [numeric]
Mean (sd) : 7 (20.5)
min ≤ med ≤ max:
1 ≤ 3 ≤ 1250
IQR (CV) : 4 (2.9)
109 distinct values 0 (0.0%)
number_of_reviews [numeric]
Mean (sd) : 23.3 (44.6)
min ≤ med ≤ max:
0 ≤ 5 ≤ 629
IQR (CV) : 23 (1.9)
394 distinct values 0 (0.0%)
last_review [Date]
min : 2011-03-28
med : 2019-05-19
max : 2019-07-08
range : 8y 3m 10d
1764 distinct values 10052 (20.6%)
reviews_per_month [numeric]
Mean (sd) : 1.4 (1.7)
min ≤ med ≤ max:
0 ≤ 0.7 ≤ 58.5
IQR (CV) : 1.8 (1.2)
937 distinct values 10052 (20.6%)
calculated_host_listings_count [numeric]
Mean (sd) : 7.1 (33)
min ≤ med ≤ max:
1 ≤ 1 ≤ 327
IQR (CV) : 1 (4.6)
47 distinct values 0 (0.0%)
availability_365 [numeric]
Mean (sd) : 112.8 (131.6)
min ≤ med ≤ max:
0 ≤ 45 ≤ 365
IQR (CV) : 227 (1.2)
366 distinct values 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-20

AB_NYC_2019 documents the information on the accommodation in New York (as you can figure out from the name of the original csv file and the average longitude and latitude) that is registered on Air BnB as of sometime in 2019. The information includes 1) id and name of the accommodation 2) id and name of the host 3) geographic information of the accommodation (neighbourhood_group, latitude, longitude) 4) reservation-related information of the accommodation (room type, price, minimum night per stay) 5) information of the reviews (date of last review, total number of review, average number of review per month) 6) days available of the accommodation per year

The accommodations without any review have NA (missing value) in last_review and reviews_per_month as we can see that the number of missing values in last_review and reviews_per_month (10052) matches the number of accommodations whose number_of_review is 0

ab_df %>% filter(`number_of_reviews` == 0) %>%
  count()
# A tibble: 1 × 1
      n
  <int>
1 10052

This dataset is quite tidy because one row means one observation (one accommodation), however it can be separated into host_df dataframe and accommodation_df dataframe to make this dataset cleaner and easier to use.

###Accommodation_df

I created accommodation_df as follows removing host_name, calculated_host_listings_count columns. (* We should NOT remove host_id because it will be necessary if we want to join accomodation_df and host_df)

accommodation_df <- ab_df %>%
  select(-c(host_name, calculated_host_listings_count))

head(accommodation_df)
# A tibble: 6 × 14
     id name       host_id neigh…¹ neigh…² latit…³ longi…⁴ room_…⁵ price minim…⁶
  <dbl> <chr>        <dbl> <chr>   <chr>     <dbl>   <dbl> <chr>   <dbl>   <dbl>
1  2539 Clean & q…    2787 Brookl… Kensin…    40.6   -74.0 Privat…   149       1
2  2595 Skylit Mi…    2845 Manhat… Midtown    40.8   -74.0 Entire…   225       1
3  3647 THE VILLA…    4632 Manhat… Harlem     40.8   -73.9 Privat…   150       3
4  3831 Cozy Enti…    4869 Brookl… Clinto…    40.7   -74.0 Entire…    89       1
5  5022 Entire Ap…    7192 Manhat… East H…    40.8   -73.9 Entire…    80      10
6  5099 Large Coz…    7322 Manhat… Murray…    40.7   -74.0 Entire…   200       3
# … with 4 more variables: number_of_reviews <dbl>, last_review <date>,
#   reviews_per_month <dbl>, availability_365 <dbl>, and abbreviated variable
#   names ¹​neighbourhood_group, ²​neighbourhood, ³​latitude, ⁴​longitude,
#   ⁵​room_type, ⁶​minimum_nights

I realized that there are 17,541 observations with 0 as price and/or availability_365, which doesn’t make sense for an accommodation listing on AirBnB.

summary(accommodation_df)
       id               name              host_id          neighbourhood_group
 Min.   :    2539   Length:48895       Min.   :     2438   Length:48895       
 1st Qu.: 9471945   Class :character   1st Qu.:  7822033   Class :character   
 Median :19677284   Mode  :character   Median : 30793816   Mode  :character   
 Mean   :19017143                      Mean   : 67620011                      
 3rd Qu.:29152178                      3rd Qu.:107434423                      
 Max.   :36487245                      Max.   :274321313                      
                                                                              
 neighbourhood         latitude       longitude       room_type        
 Length:48895       Min.   :40.50   Min.   :-74.24   Length:48895      
 Class :character   1st Qu.:40.69   1st Qu.:-73.98   Class :character  
 Mode  :character   Median :40.72   Median :-73.96   Mode  :character  
                    Mean   :40.73   Mean   :-73.95                     
                    3rd Qu.:40.76   3rd Qu.:-73.94                     
                    Max.   :40.91   Max.   :-73.71                     
                                                                       
     price         minimum_nights    number_of_reviews  last_review        
 Min.   :    0.0   Min.   :   1.00   Min.   :  0.00    Min.   :2011-03-28  
 1st Qu.:   69.0   1st Qu.:   1.00   1st Qu.:  1.00    1st Qu.:2018-07-08  
 Median :  106.0   Median :   3.00   Median :  5.00    Median :2019-05-19  
 Mean   :  152.7   Mean   :   7.03   Mean   : 23.27    Mean   :2018-10-04  
 3rd Qu.:  175.0   3rd Qu.:   5.00   3rd Qu.: 24.00    3rd Qu.:2019-06-23  
 Max.   :10000.0   Max.   :1250.00   Max.   :629.00    Max.   :2019-07-08  
                                                       NA's   :10052       
 reviews_per_month availability_365
 Min.   : 0.010    Min.   :  0.0   
 1st Qu.: 0.190    1st Qu.:  0.0   
 Median : 0.720    Median : 45.0   
 Mean   : 1.373    Mean   :112.8   
 3rd Qu.: 2.020    3rd Qu.:227.0   
 Max.   :58.500    Max.   :365.0   
 NA's   :10052                     
accommodation_df %>% 
  select(c("name", "price", "availability_365")) %>%
  filter(price == 0 | availability_365 == 0)
# A tibble: 17,541 × 3
   name                                               price availability_365
   <chr>                                              <dbl>            <dbl>
 1 Entire Apt: Spacious Studio/Loft by central park      80                0
 2 BlissArtsSpace!                                       60                0
 3 Cozy Clean Guest Room - Family Apt                    79                0
 4 West Village Nest - Superhost                        120                0
 5 Sweet and Spacious Brooklyn Loft                     299                0
 6 Magnifique Suite au N de Manhattan - vue Cloitres     80                0
 7 bright and stylish duplex                            115                0
 8 Light-filled 2B duplex in the heart of Park Slope!   225                0
 9 Great Location for NYC                                50                0
10 Charming 1 bed GR8 WBurg LOCATION!                   100                0
# … with 17,531 more rows

These listings should be considered unavailable or at least not ready for receiving the travelers, so I created a new column outlier (1 = outlier, 0 = no outlier).

accommodation_df <- accommodation_df %>%
  mutate(outlier = case_when(.$price == 0|.$availability_365 == 0 ~ 1, 
                                  TRUE ~ 0))

Looking at minimum_nights, I noticed that some accommodations require a lot of days for the minimum nights.

I learned from the below table that 1) The minimum nights of 85% + listings are 7 nights or less. 2) The minimum nights of 98% + listings are 31 nights or less. 3) There are small peaks at 30 nights (3760), 60 nights (106), 90 nights (104), and 120 nights (28), inferring that some listings require 1,2,3, or 4 month stay.

accommodation_df %>%
  group_by(minimum_nights) %>%
  dplyr::summarize(
    frequency = n(),
  ) %>%
  mutate(
    cumulative_percentage = cumsum(frequency)/sum(frequency)*100
  ) %>%
  kable()
minimum_nights frequency cumulative_percentage
1 12720 26.01493
2 11696 49.93558
3 7999 66.29512
4 3303 73.05041
5 3034 79.25555
6 752 80.79354
7 2058 85.00256
8 130 85.26843
9 80 85.43205
10 483 86.41988
11 33 86.48737
12 91 86.67348
13 54 86.78392
14 562 87.93333
15 279 88.50394
16 18 88.54075
17 14 88.56938
18 28 88.62665
19 8 88.64301
20 223 89.09909
21 135 89.37519
22 8 89.39155
23 8 89.40791
24 14 89.43655
25 82 89.60425
26 16 89.63698
27 17 89.67175
28 203 90.08692
29 340 90.78229
30 3760 98.47224
31 201 98.88332
32 8 98.89968
33 2 98.90377
35 15 98.93445
36 1 98.93650
37 1 98.93854
39 1 98.94059
40 13 98.96717
42 1 98.96922
43 1 98.97127
44 2 98.97536
45 32 99.04080
47 2 99.04489
50 14 99.07352
53 2 99.07762
55 4 99.08580
56 2 99.08989
58 1 99.09193
59 6 99.10420
60 106 99.32099
62 3 99.32713
65 1 99.32917
68 1 99.33122
70 8 99.34758
74 1 99.34963
75 4 99.35781
80 9 99.37621
85 1 99.37826
87 1 99.38030
88 2 99.38440
90 104 99.59710
91 3 99.60323
93 2 99.60732
98 1 99.60937
99 2 99.61346
100 15 99.64414
105 2 99.64823
110 3 99.65436
114 1 99.65641
115 2 99.66050
120 28 99.71776
122 1 99.71981
133 2 99.72390
134 1 99.72594
150 6 99.73821
153 1 99.74026
160 2 99.74435
175 1 99.74640
180 43 99.83434
181 2 99.83843
182 1 99.84047
183 1 99.84252
184 1 99.84456
185 1 99.84661
186 1 99.84866
198 1 99.85070
200 4 99.85888
210 2 99.86297
222 1 99.86502
225 1 99.86706
240 3 99.87320
250 1 99.87524
265 1 99.87729
270 2 99.88138
275 1 99.88342
299 1 99.88547
300 6 99.89774
354 1 99.89979
360 5 99.91001
364 1 99.91206
365 29 99.97137
366 1 99.97341
370 1 99.97546
400 1 99.97750
480 1 99.97955
500 5 99.98977
999 3 99.99591
1000 1 99.99795
1250 1 100.00000

Thus, it is fairly safe to consider that the accommodation that require more than 31 days (1 month) for the minimum nights as outliers. Since these accommodations are not available for short-term travelers, I consider them as “outlier” (outlier = 1)

accommodation_df <- accommodation_df %>%
  mutate(outlier = case_when(.$price == 0|.$availability_365 == 0| minimum_nights > 31 ~ 1, 
                                  TRUE ~ 0))

accommodation_df
# A tibble: 48,895 × 15
      id name      host_id neigh…¹ neigh…² latit…³ longi…⁴ room_…⁵ price minim…⁶
   <dbl> <chr>       <dbl> <chr>   <chr>     <dbl>   <dbl> <chr>   <dbl>   <dbl>
 1  2539 Clean & …    2787 Brookl… Kensin…    40.6   -74.0 Privat…   149       1
 2  2595 Skylit M…    2845 Manhat… Midtown    40.8   -74.0 Entire…   225       1
 3  3647 THE VILL…    4632 Manhat… Harlem     40.8   -73.9 Privat…   150       3
 4  3831 Cozy Ent…    4869 Brookl… Clinto…    40.7   -74.0 Entire…    89       1
 5  5022 Entire A…    7192 Manhat… East H…    40.8   -73.9 Entire…    80      10
 6  5099 Large Co…    7322 Manhat… Murray…    40.7   -74.0 Entire…   200       3
 7  5121 BlissArt…    7356 Brookl… Bedfor…    40.7   -74.0 Privat…    60      45
 8  5178 Large Fu…    8967 Manhat… Hell's…    40.8   -74.0 Privat…    79       2
 9  5203 Cozy Cle…    7490 Manhat… Upper …    40.8   -74.0 Privat…    79       2
10  5238 Cute & C…    7549 Manhat… Chinat…    40.7   -74.0 Entire…   150       1
# … with 48,885 more rows, 5 more variables: number_of_reviews <dbl>,
#   last_review <date>, reviews_per_month <dbl>, availability_365 <dbl>,
#   outlier <dbl>, and abbreviated variable names ¹​neighbourhood_group,
#   ²​neighbourhood, ³​latitude, ⁴​longitude, ⁵​room_type, ⁶​minimum_nights

Some accommodations cost more than 1000 USD per night, which is extremely expensive. Looking at the below table, only less than 0.5% of accommodations charge 1000+ USD per night so I decided to consider those accommodations as outliers.

accommodation_df %>%
  group_by(price) %>%
  dplyr::summarize(
    frequency = n(),
  ) %>%
  mutate(
    cumulative_percentage = cumsum(frequency)/sum(frequency)*100
  ) %>%
  kable()
price frequency cumulative_percentage
0 11 0.0224972
10 17 0.0572656
11 3 0.0634012
12 4 0.0715820
13 1 0.0736272
15 6 0.0858984
16 6 0.0981695
18 2 0.1022599
19 4 0.1104407
20 33 0.1779323
21 6 0.1902035
22 16 0.2229267
23 7 0.2372431
24 13 0.2638307
25 100 0.4683505
26 24 0.5174353
27 29 0.5767461
28 55 0.6892320
29 67 0.8262604
30 255 1.3477861
31 46 1.4418652
32 82 1.6095715
33 93 1.7997750
34 84 1.9715717
35 480 2.9532672
36 103 3.1639227
37 132 3.4338889
38 167 3.7754372
39 196 4.1762961
40 771 5.7531445
41 86 5.9290316
42 192 6.3217098
43 113 6.5528173
44 119 6.7961959
45 891 8.6184681
46 102 8.8270784
47 163 9.1604459
48 203 9.5756212
49 345 10.2812148
50 1534 13.4185500
51 96 13.6148890
52 131 13.8828101
53 118 14.1241436
54 112 14.3532058
55 950 16.2961448
56 74 16.4474895
57 109 16.6704162
58 157 16.9915124
59 297 17.5989365
60 1458 20.5808365
61 54 20.6912772
62 140 20.9776051
63 108 21.1984866
64 90 21.3825545
65 1190 23.8163411
66 87 23.9942734
67 107 24.2131097
68 160 24.5403415
69 372 25.3011555
70 1170 27.6940382
71 60 27.8167502
72 124 28.0703548
73 84 28.2421515
74 68 28.3812251
75 1370 31.1831476
76 68 31.3222211
77 90 31.5062890
78 122 31.7558033
79 299 32.3673177
80 1272 34.9688107
81 51 35.0731159
82 71 35.2183250
83 63 35.3471725
84 75 35.5005624
85 877 37.2942019
86 79 37.4557726
87 83 37.6255241
88 115 37.8607220
89 365 38.6072196
90 1021 40.6953676
91 52 40.8017180
92 61 40.9264751
93 64 41.0573678
94 68 41.1964414
95 700 42.6280806
96 64 42.7589733
97 91 42.9450864
98 137 43.2252787
99 742 44.7428162
100 2051 48.9375192
101 46 49.0315983
102 45 49.1236323
103 49 49.2238470
104 44 49.3138358
105 325 49.9785254
106 35 50.0501074
107 42 50.1360057
108 54 50.2464465
109 152 50.5573167
110 739 52.0687187
111 38 52.1464362
112 53 52.2548318
113 42 52.3407301
114 28 52.3979957
115 468 53.3551488
116 34 53.4246856
117 53 53.5330811
118 57 53.6496574
119 191 54.0402904
120 1130 56.3513652
121 23 56.3984047
122 41 56.4822579
123 44 56.5722467
124 47 56.6683710
125 1057 58.8301462
126 33 58.8976378
127 27 58.9528582
128 60 59.0755701
129 207 59.4989263
130 610 60.7464976
131 24 60.7955824
132 36 60.8692095
133 53 60.9776051
134 36 61.0512322
135 415 61.8999898
136 24 61.9490745
137 35 62.0206565
138 38 62.0983741
139 170 62.4460579
140 548 63.5668269
141 11 63.5893241
142 55 63.7018100
143 24 63.7508948
144 31 63.8142959
145 323 64.4748952
146 25 64.5260252
147 32 64.5914715
148 50 64.6937315
149 278 65.2622968
150 2047 69.4488189
151 35 69.5204009
152 24 69.5694856
153 21 69.6124348
154 34 69.6819716
155 183 70.0562430
156 21 70.0991921
157 20 70.1400961
158 37 70.2157685
159 164 70.5511811
160 480 71.5328766
161 15 71.5635546
162 22 71.6085489
163 22 71.6535433
164 23 71.7005829
165 315 72.3448205
166 20 72.3857245
167 23 72.4327641
168 34 72.5023008
169 123 72.7538603
170 295 73.3571940
171 12 73.3817364
172 28 73.4390019
173 16 73.4717251
174 31 73.5351263
175 763 75.0956130
176 19 75.1344718
177 30 75.1958278
178 23 75.2428674
179 143 75.5353308
180 522 76.6029246
181 14 76.6315574
182 21 76.6745066
183 9 76.6929134
184 24 76.7419982
185 236 77.2246651
186 18 77.2614787
187 24 77.3105635
188 30 77.3719194
189 123 77.6234789
190 292 78.2206770
191 13 78.2472645
192 10 78.2677165
193 5 78.2779425
194 15 78.3086205
195 293 78.9078638
196 27 78.9630842
197 28 79.0203497
198 58 79.1389713
199 415 79.9877288
200 1401 82.8530525
201 8 82.8694141
202 15 82.9000920
203 9 82.9184988
204 7 82.9328152
205 50 83.0350752
206 15 83.0657531
207 12 83.0902955
208 13 83.1168831
209 46 83.2109623
210 133 83.4829737
211 12 83.5075161
212 14 83.5361489
213 3 83.5422845
214 14 83.5709173
215 111 83.7979343
216 4 83.8061151
217 12 83.8306575
218 12 83.8551999
219 56 83.9697311
220 254 84.4892116
221 8 84.5055732
222 14 84.5342060
223 6 84.5464771
224 13 84.5730647
225 451 85.4954494
226 12 85.5199918
227 14 85.5486246
228 23 85.5956642
229 64 85.7265569
230 109 85.9494836
231 6 85.9617548
232 10 85.9822068
233 11 86.0047040
234 12 86.0292463
235 66 86.1642295
236 10 86.1846815
237 6 86.1969527
238 6 86.2092238
239 106 86.4260149
240 114 86.6591676
241 9 86.6775744
242 4 86.6857552
243 10 86.7062072
244 15 86.7368852
245 93 86.9270887
246 7 86.9414051
247 11 86.9639022
248 15 86.9945802
249 133 87.2665917
250 1018 89.3486042
251 5 89.3588301
252 7 89.3731465
253 3 89.3792821
254 7 89.3935985
255 42 89.4794969
256 8 89.4958585
257 5 89.5060845
258 5 89.5163105
259 30 89.5776664
260 81 89.7433275
261 5 89.7535535
262 8 89.7699151
263 5 89.7801411
264 12 89.8046835
265 58 89.9233050
266 1 89.9253502
267 9 89.9437570
268 6 89.9560282
269 33 90.0235198
270 42 90.1094181
271 10 90.1298701
272 4 90.1380509
273 5 90.1482769
274 9 90.1666837
275 244 90.6657122
276 8 90.6820738
277 3 90.6882094
278 15 90.7188874
279 23 90.7659270
280 91 90.9520401
281 2 90.9561305
282 5 90.9663565
283 5 90.9765825
284 2 90.9806729
285 65 91.1136108
287 7 91.1279272
288 12 91.1524696
289 38 91.2301871
290 63 91.3590347
291 3 91.3651703
292 1 91.3672155
293 3 91.3733511
294 6 91.3856223
295 114 91.6187749
296 1 91.6208201
297 5 91.6310461
298 13 91.6576337
299 161 91.9869107
300 561 93.1342673
301 1 93.1363125
302 7 93.1506289
303 7 93.1649453
304 3 93.1710809
305 23 93.2181205
306 6 93.2303917
307 5 93.2406177
308 7 93.2549340
309 7 93.2692504
310 21 93.3121996
311 2 93.3162900
312 9 93.3346968
313 1 93.3367420
314 9 93.3551488
315 19 93.3940076
316 13 93.4205952
317 4 93.4287759
318 1 93.4308211
319 8 93.4471827
320 39 93.5269455
321 6 93.5392167
322 1 93.5412619
323 2 93.5453523
324 2 93.5494427
325 120 93.7948666
327 1 93.7969117
328 3 93.8030473
329 13 93.8296349
330 27 93.8848553
331 1 93.8869005
332 3 93.8930361
333 9 93.9114429
334 5 93.9216689
335 8 93.9380305
336 1 93.9400757
337 2 93.9441661
338 1 93.9462113
339 16 93.9789345
340 18 94.0157480
341 4 94.0239288
342 1 94.0259740
343 4 94.0341548
344 1 94.0362000
345 25 94.0873300
346 2 94.0914204
347 4 94.0996012
348 3 94.1057368
349 45 94.1977707
350 381 94.9769915
352 4 94.9851723
353 1 94.9872175
355 10 95.0076695
356 2 95.0117599
358 7 95.0260763
359 13 95.0526639
360 21 95.0956130
361 1 95.0976582
362 1 95.0997034
365 18 95.1365170
367 2 95.1406074
369 19 95.1794662
370 13 95.2060538
372 2 95.2101442
373 1 95.2121894
374 6 95.2244606
375 72 95.3717149
376 3 95.3778505
377 5 95.3880765
378 3 95.3942121
379 16 95.4269353
380 36 95.5005624
381 1 95.5026076
382 3 95.5087432
383 1 95.5107884
385 14 95.5394212
386 1 95.5414664
387 2 95.5455568
388 3 95.5516924
389 10 95.5721444
390 24 95.6212292
392 4 95.6294100
393 2 95.6335004
394 2 95.6375908
395 41 95.7214439
396 6 95.7337151
397 1 95.7357603
398 8 95.7521219
399 89 95.9341446
400 225 96.3943143
402 1 96.3963595
403 2 96.4004499
404 2 96.4045403
407 2 96.4086307
408 3 96.4147663
409 5 96.4249923
410 6 96.4372635
412 5 96.4474895
414 2 96.4515799
415 5 96.4618059
416 4 96.4699867
417 1 96.4720319
418 2 96.4761223
419 5 96.4863483
420 13 96.5129359
421 2 96.5170263
422 1 96.5190715
423 1 96.5211167
424 1 96.5231619
425 50 96.6254218
426 1 96.6274670
428 1 96.6295122
429 5 96.6397382
430 6 96.6520094
431 1 96.6540546
432 2 96.6581450
433 2 96.6622354
434 1 96.6642806
435 3 96.6704162
436 1 96.6724614
438 2 96.6765518
439 2 96.6806422
440 4 96.6888230
442 1 96.6908682
443 3 96.6970038
444 2 96.7010942
445 2 96.7051846
446 1 96.7072298
447 2 96.7113202
449 12 96.7358626
450 156 97.0549136
451 2 97.0590040
452 1 97.0610492
453 1 97.0630944
454 1 97.0651396
455 4 97.0733204
459 5 97.0835464
460 7 97.0978628
462 1 97.0999080
463 1 97.1019532
465 7 97.1162696
466 2 97.1203600
468 2 97.1244504
469 1 97.1264956
470 4 97.1346763
471 1 97.1367215
472 3 97.1428571
474 1 97.1449023
475 23 97.1919419
477 1 97.1939871
478 2 97.1980775
479 5 97.2083035
480 8 97.2246651
481 1 97.2267103
483 1 97.2287555
484 2 97.2328459
485 7 97.2471623
487 2 97.2512527
488 2 97.2553431
489 7 97.2696595
490 7 97.2839759
495 25 97.3351058
496 1 97.3371510
497 1 97.3391962
498 1 97.3412414
499 65 97.4741794
500 191 97.8648124
503 2 97.8689028
505 3 97.8750383
506 1 97.8770835
509 1 97.8791287
510 2 97.8832191
515 3 97.8893547
519 1 97.8913999
520 5 97.9016259
524 1 97.9036711
525 15 97.9343491
530 2 97.9384395
531 1 97.9404847
534 7 97.9548011
539 1 97.9568463
540 3 97.9629819
545 5 97.9732079
549 5 97.9834339
550 51 98.0877390
555 1 98.0897842
559 1 98.0918294
560 5 98.1020554
561 1 98.1041006
573 1 98.1061458
575 12 98.1306882
577 2 98.1347786
578 1 98.1368238
580 1 98.1388690
583 2 98.1429594
584 1 98.1450046
585 5 98.1552306
587 1 98.1572758
589 2 98.1613662
590 8 98.1777278
595 13 98.2043154
599 19 98.2431741
600 81 98.4088353
612 1 98.4108805
615 2 98.4149709
616 1 98.4170161
619 1 98.4190613
620 2 98.4231517
625 4 98.4313324
639 1 98.4333776
640 2 98.4374680
643 1 98.4395132
647 1 98.4415584
649 7 98.4558748
650 55 98.5683608
655 1 98.5704060
659 1 98.5724512
670 2 98.5765416
672 2 98.5806320
675 13 98.6072196
680 1 98.6092648
688 1 98.6113099
689 1 98.6133551
690 4 98.6215359
692 1 98.6235811
695 2 98.6276715
699 35 98.6992535
700 47 98.7953779
714 12 98.8199202
718 1 98.8219654
720 1 98.8240106
725 1 98.8260558
737 10 98.8465078
745 2 98.8505982
748 7 98.8649146
749 2 98.8690050
750 45 98.9610390
760 2 98.9651294
766 1 98.9671746
780 1 98.9692198
785 2 98.9733102
789 1 98.9753554
790 2 98.9794458
794 1 98.9814909
795 6 98.9937621
799 18 99.0305757
800 54 99.1410165
805 1 99.1430617
814 1 99.1451069
820 1 99.1471521
822 1 99.1491973
825 2 99.1532877
830 1 99.1553329
840 2 99.1594233
843 1 99.1614685
848 1 99.1635137
850 19 99.2023724
860 2 99.2064628
875 1 99.2085080
888 1 99.2105532
890 3 99.2166888
894 1 99.2187340
895 3 99.2248696
898 1 99.2269148
899 5 99.2371408
900 20 99.2780448
913 1 99.2800900
920 1 99.2821352
930 1 99.2841804
935 1 99.2862256
945 3 99.2923612
949 2 99.2964516
950 11 99.3189488
956 3 99.3250844
974 1 99.3271296
975 4 99.3353104
980 2 99.3394008
985 1 99.3414460
990 1 99.3434912
995 6 99.3557623
999 17 99.3905307
1000 59 99.5111975
1002 1 99.5132427
1020 1 99.5152879
1046 1 99.5173331
1050 3 99.5234687
1066 1 99.5255139
1067 1 99.5275591
1075 1 99.5296043
1080 1 99.5316495
1095 1 99.5336947
1099 1 99.5357399
1100 12 99.5602822
1115 1 99.5623274
1145 1 99.5643726
1150 2 99.5684630
1170 1 99.5705082
1177 1 99.5725534
1195 2 99.5766438
1200 15 99.6073218
1250 6 99.6195930
1295 1 99.6216382
1299 1 99.6236834
1300 3 99.6298190
1306 1 99.6318642
1308 1 99.6339094
1315 1 99.6359546
1333 1 99.6379998
1350 2 99.6420902
1369 1 99.6441354
1395 1 99.6461806
1400 2 99.6502710
1450 1 99.6523162
1475 1 99.6543614
1485 1 99.6564066
1494 1 99.6584518
1495 1 99.6604970
1497 1 99.6625422
1500 26 99.7157174
1550 1 99.7177626
1599 2 99.7218530
1600 4 99.7300337
1680 1 99.7320789
1700 4 99.7402597
1731 1 99.7423049
1749 1 99.7443501
1750 3 99.7504857
1763 1 99.7525309
1795 1 99.7545761
1799 1 99.7566213
1800 2 99.7607117
1880 1 99.7627569
1899 1 99.7648021
1900 1 99.7668473
1978 1 99.7688925
1999 3 99.7750281
2000 24 99.8241129
2010 1 99.8261581
2100 1 99.8282033
2103 1 99.8302485
2200 1 99.8322937
2250 2 99.8363841
2300 1 99.8384293
2350 1 99.8404745
2400 1 99.8425197
2500 11 99.8650169
2545 1 99.8670621
2590 1 99.8691073
2600 1 99.8711525
2626 1 99.8731977
2695 1 99.8752429
2750 1 99.8772881
2800 1 99.8793333
2850 1 99.8813785
2900 1 99.8834237
2990 1 99.8854689
2995 1 99.8875141
2999 2 99.8916045
3000 8 99.9079660
3200 1 99.9100112
3210 1 99.9120564
3512 1 99.9141016
3518 1 99.9161468
3600 1 99.9181920
3613 1 99.9202372
3750 2 99.9243276
3800 1 99.9263728
3900 1 99.9284180
4000 3 99.9345536
4100 1 99.9365988
4160 1 99.9386440
4200 2 99.9427344
4500 2 99.9468248
5000 6 99.9590960
5100 1 99.9611412
5250 1 99.9631864
6000 2 99.9672768
6419 1 99.9693220
6500 3 99.9754576
6800 1 99.9775028
7500 2 99.9815932
7703 1 99.9836384
8000 1 99.9856836
8500 1 99.9877288
9999 3 99.9938644
10000 3 100.0000000
accommodation_df <- accommodation_df %>%
  mutate(outlier = case_when(.$price == 0|.$price > 1000|.$availability_365 == 0| minimum_nights > 31 ~ 1, 
                                  TRUE ~ 0))

accommodation_df
# A tibble: 48,895 × 15
      id name      host_id neigh…¹ neigh…² latit…³ longi…⁴ room_…⁵ price minim…⁶
   <dbl> <chr>       <dbl> <chr>   <chr>     <dbl>   <dbl> <chr>   <dbl>   <dbl>
 1  2539 Clean & …    2787 Brookl… Kensin…    40.6   -74.0 Privat…   149       1
 2  2595 Skylit M…    2845 Manhat… Midtown    40.8   -74.0 Entire…   225       1
 3  3647 THE VILL…    4632 Manhat… Harlem     40.8   -73.9 Privat…   150       3
 4  3831 Cozy Ent…    4869 Brookl… Clinto…    40.7   -74.0 Entire…    89       1
 5  5022 Entire A…    7192 Manhat… East H…    40.8   -73.9 Entire…    80      10
 6  5099 Large Co…    7322 Manhat… Murray…    40.7   -74.0 Entire…   200       3
 7  5121 BlissArt…    7356 Brookl… Bedfor…    40.7   -74.0 Privat…    60      45
 8  5178 Large Fu…    8967 Manhat… Hell's…    40.8   -74.0 Privat…    79       2
 9  5203 Cozy Cle…    7490 Manhat… Upper …    40.8   -74.0 Privat…    79       2
10  5238 Cute & C…    7549 Manhat… Chinat…    40.7   -74.0 Entire…   150       1
# … with 48,885 more rows, 5 more variables: number_of_reviews <dbl>,
#   last_review <date>, reviews_per_month <dbl>, availability_365 <dbl>,
#   outlier <dbl>, and abbreviated variable names ¹​neighbourhood_group,
#   ²​neighbourhood, ³​latitude, ⁴​longitude, ⁵​room_type, ⁶​minimum_nights

###host_df.

host_df <- ab_df %>%
  select(c(host_id, host_name, calculated_host_listings_count))

head(host_df)
# A tibble: 6 × 3
  host_id host_name   calculated_host_listings_count
    <dbl> <chr>                                <dbl>
1    2787 John                                     6
2    2845 Jennifer                                 2
3    4632 Elisabeth                                1
4    4869 LisaRoxanne                              1
5    7192 Laura                                    1
6    7322 Chris                                    1

host_df needs to be cleaned because it has several duplicated information. For example, host_id should be unique however the same host_id appear multiple times as the below table shows.

host_df %>% group_by(host_id) %>%
  dplyr::summarise(number =n()) %>%
  arrange(desc(number))
# A tibble: 37,457 × 2
     host_id number
       <dbl>  <int>
 1 219517861    327
 2 107434423    232
 3  30283594    121
 4 137358866    103
 5  12243051     96
 6  16098958     96
 7  61391963     91
 8  22541573     87
 9 200380610     65
10   1475015     52
# … with 37,447 more rows

So I removed duplicated rows by using distinct() function

host_df <- host_df %>% distinct(host_id, .keep_all = TRUE) %>%
  arrange(desc(calculated_host_listings_count))

head(host_df)
# A tibble: 6 × 3
    host_id host_name      calculated_host_listings_count
      <dbl> <chr>                                   <dbl>
1 219517861 Sonder (NYC)                              327
2 107434423 Blueground                                232
3  30283594 Kara                                      121
4 137358866 Kazuya                                    103
5  16098958 Jeremy & Laura                             96
6  12243051 Sonder                                     96

Also, I renamed the column name calculated_host_listings_count to make it easier to understand

colnames(host_df)[3] <- "total_accommodation_count"
head(host_df)
# A tibble: 6 × 3
    host_id host_name      total_accommodation_count
      <dbl> <chr>                              <dbl>
1 219517861 Sonder (NYC)                         327
2 107434423 Blueground                           232
3  30283594 Kara                                 121
4 137358866 Kazuya                               103
5  16098958 Jeremy & Laura                        96
6  12243051 Sonder                                96
print(summarytools::dfSummary(host_df,
                         varnumbers = FALSE,
                         plain.ascii  = FALSE,
                         style        = "grid",
                         graph.magnif = 0.80,
                        valid.col    = FALSE),
       method = 'render',
       table.classes = 'table-condensed')

Data Frame Summary

host_df

Dimensions: 37457 x 3
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
host_id [numeric]
Mean (sd) : 61445493 (75117502)
min ≤ med ≤ max:
2438 ≤ 26790498 ≤ 274321313
IQR (CV) : 81121712 (1.2)
37457 distinct values 0 (0.0%)
host_name [character]
1. Michael
2. David
3. Alex
4. Daniel
5. Sarah
6. John
7. Andrew
8. Chris
9. Maria
10. Anna
[ 11442 others ]
335(0.9%)
318(0.8%)
226(0.6%)
208(0.6%)
199(0.5%)
188(0.5%)
163(0.4%)
162(0.4%)
161(0.4%)
156(0.4%)
35323(94.3%)
18 (0.0%)
total_accommodation_count [numeric]
Mean (sd) : 1.3 (2.8)
min ≤ med ≤ max:
1 ≤ 1 ≤ 327
IQR (CV) : 0 (2.1)
47 distinct values 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-20

Visualization with Multiple Dimensions

What are the characteristics of each neighborhood?

First, I decided to visualize the number of accommodation listings by room types and neighborhood groups.

  1. Most listings are located in Manhattan or Brooklyn.
  2. There are few listings of shared rooms compared to entire home and private rooms.
accommodation_df %>% 
  filter(outlier==0) %>%
  group_by(neighbourhood_group, room_type)%>%
  dplyr::summarize(
    frequency = n()
  ) %>%
  ggplot(aes(fill=room_type, y=frequency, x=neighbourhood_group)) + geom_bar(position="stack", stat="identity") +
  labs(title = "The number of accommodation listings by room types and neibhgourhood groups")

I decided to visualize the price data by neighbourhood group with a violin plot because it allows us to see the distribution and the volume for several groups.

I learned that, 1) Manhattan is the most expensive neighbourhood regardless of room types. 2) Even though Manhattan and Brooklyn are typically more expensive than other neighbourhoods, they offer more reasonable entire home/apartments as well.

accommodation_df %>% filter(outlier==0) %>%
  ggplot(aes(neighbourhood_group, price)) +
  geom_violin() +
  facet_wrap(vars(room_type)) + 
  theme(axis.text.x = element_text(angle = 45)) +
  labs(title = "Price per night by room types and neighbourhood groups") +
  coord_flip()

accommodation_df %>% filter(outlier == 0) %>%
  ggplot(aes(reviews_per_month)) + 
  geom_boxplot() + 
  facet_grid(neighbourhood_group~room_type) +
  scale_x_continuous(limits=c(0,30), breaks=seq(0,30,5)) +
  labs(title = "The number of reviews per month by room types and neighbourhood groups")

Source Code
---
title: "Challenge 7 Erika Nagai"
author: "Erika Nagai"
description: "Visualizing Multiple Dimensions"
date: "10/31/2022"
format:
  html:
    toc: true
    code-copy: true
    code-tools: true
categories:
  - challenge_7
  - air_bnb
editor: 
  markdown: 
    wrap: sentence
---

```{r}
#| warning: false
#| message: false

library(tidyverse)
library(ggplot2)
library(summarytools)
library(knitr)
library(wordcloud)
library(ggwordcloud)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```


## Challenge Overview

Today's challenge is to:

1)  read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
2)  tidy data (as needed, including sanity checks)
3)  mutate variables as needed (including sanity checks)
4)  Recreate at least two graphs from previous exercises, but introduce at least one additional dimension that you omitted before using ggplot functionality (color, shape, line, facet, etc) The goal is not to create unneeded [chart ink (Tufte)](https://www.edwardtufte.com/tufte/), but to concisely capture variation in additional dimensions that were collapsed in your earlier 2 or 3 dimensional graphs.

-   Explain why you choose the specific graph type

5)  If you haven't tried in previous weeks, work this week to make your graphs "publication" ready with titles, captions, and pretty axis labels and other viewer-friendly features

[R Graph Gallery](https://r-graph-gallery.com/) is a good starting point for thinking about what information is conveyed in standard graph types, and includes example R code.
And anyone not familiar with Edward Tufte should check out his [fantastic books](https://www.edwardtufte.com/tufte/books_vdqi) and [courses on data visualizaton.](https://www.edwardtufte.com/tufte/courses)

(be sure to only include the category tags for the data you use!)

## Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

-   eggs ⭐
-   abc_poll ⭐⭐
-   australian_marriage ⭐⭐
-   hotel_bookings ⭐⭐⭐
-   air_bnb ⭐⭐⭐
-   us_hh ⭐⭐⭐⭐
-   faostat ⭐⭐⭐⭐⭐

Using `read_csv` function, I read in AB_NYC_2019.csv as "ab_df"

```{r}
ab_df = read_csv("_data/AB_NYC_2019.csv")

```

### Briefly describe the data

I explored this dataset to understand how it is structured and what kind of data is included

This dataset consists of 16 columns (variables) and 48895 rows.
It includes the follwoing variables.

```{r}
str(ab_df)
```

```{r}
print(summarytools::dfSummary(ab_df,
                         varnumbers = FALSE,
                         plain.ascii  = FALSE,
                         style        = "grid",
                         graph.magnif = 0.80,
                        valid.col    = FALSE),
       method = 'render',
       table.classes = 'table-condensed')


```

`AB_NYC_2019` documents the information on the accommodation in New York (as you can figure out from the name of the original csv file and the average longitude and latitude) that is registered on Air BnB as of sometime in 2019.
The information includes 1) id and name of the accommodation 2) id and name of the host 3) geographic information of the accommodation (neighbourhood_group, latitude, longitude) 4) reservation-related information of the accommodation (room type, price, minimum night per stay) 5) information of the reviews (date of last review, total number of review, average number of review per month) 6) days available of the accommodation per year

The accommodations without any review have NA (missing value) in `last_review` and `reviews_per_month` as we can see that the number of missing values in `last_review` and `reviews_per_month` (10052) matches the number of accommodations whose `number_of_review` is 0

```{r}
ab_df %>% filter(`number_of_reviews` == 0) %>%
  count()
```

This dataset is quite tidy because one row means one observation (one accommodation), however it can be separated into `host_df` dataframe and `accommodation_df` dataframe to make this dataset cleaner and easier to use.

\###`Accommodation_df`

I created `accommodation_df` as follows removing `host_name`, `calculated_host_listings_count` columns.
(\* We should NOT remove `host_id` because it will be necessary if we want to join `accomodation_df` and `host_df`)

```{r}
accommodation_df <- ab_df %>%
  select(-c(host_name, calculated_host_listings_count))

head(accommodation_df)

```

I realized that there are 17,541 observations with 0 as `price` and/or `availability_365`, which doesn't make sense for an accommodation listing on AirBnB.

```{r}
summary(accommodation_df)

accommodation_df %>% 
  select(c("name", "price", "availability_365")) %>%
  filter(price == 0 | availability_365 == 0)
```

These listings should be considered unavailable or at least not ready for receiving the travelers, so I created a new column `outlier` (1 = outlier, 0 = no outlier).

```{r}
accommodation_df <- accommodation_df %>%
  mutate(outlier = case_when(.$price == 0|.$availability_365 == 0 ~ 1, 
                                  TRUE ~ 0))
```

Looking at `minimum_nights`, I noticed that some accommodations require a lot of days for the minimum nights.

I learned from the below table that 1) The minimum nights of 85% + listings are 7 nights or less.
2) The minimum nights of 98% + listings are 31 nights or less.
3) There are small peaks at 30 nights (3760), 60 nights (106), 90 nights (104), and 120 nights (28), inferring that some listings require 1,2,3, or 4 month stay.

```{r}
accommodation_df %>%
  group_by(minimum_nights) %>%
  dplyr::summarize(
    frequency = n(),
  ) %>%
  mutate(
    cumulative_percentage = cumsum(frequency)/sum(frequency)*100
  ) %>%
  kable()
  
  
```

Thus, it is fairly safe to consider that the accommodation that require more than 31 days (1 month) for the minimum nights as outliers.
Since these accommodations are not available for short-term travelers, I consider them as "outlier" (`outlier` = 1)

```{r}
accommodation_df <- accommodation_df %>%
  mutate(outlier = case_when(.$price == 0|.$availability_365 == 0| minimum_nights > 31 ~ 1, 
                                  TRUE ~ 0))

accommodation_df
```

Some accommodations cost more than 1000 USD per night, which is extremely expensive.
Looking at the below table, only less than 0.5% of accommodations charge 1000+ USD per night so I decided to consider those accommodations as outliers.

```{r}
accommodation_df %>%
  group_by(price) %>%
  dplyr::summarize(
    frequency = n(),
  ) %>%
  mutate(
    cumulative_percentage = cumsum(frequency)/sum(frequency)*100
  ) %>%
  kable()
```

```{r}
accommodation_df <- accommodation_df %>%
  mutate(outlier = case_when(.$price == 0|.$price > 1000|.$availability_365 == 0| minimum_nights > 31 ~ 1, 
                                  TRUE ~ 0))

accommodation_df

```

\###`host_df`.

```{r}
host_df <- ab_df %>%
  select(c(host_id, host_name, calculated_host_listings_count))

head(host_df)
```

`host_df` needs to be cleaned because it has several duplicated information.
For example, `host_id` should be unique however the same host_id appear multiple times as the below table shows.

```{r}
host_df %>% group_by(host_id) %>%
  dplyr::summarise(number =n()) %>%
  arrange(desc(number))
```

So I removed duplicated rows by using `distinct()` function

```{r}
host_df <- host_df %>% distinct(host_id, .keep_all = TRUE) %>%
  arrange(desc(calculated_host_listings_count))

head(host_df)
```

Also, I renamed the column name `calculated_host_listings_count` to make it easier to understand

```{r}
colnames(host_df)[3] <- "total_accommodation_count"
head(host_df)
```

```{r}
print(summarytools::dfSummary(host_df,
                         varnumbers = FALSE,
                         plain.ascii  = FALSE,
                         style        = "grid",
                         graph.magnif = 0.80,
                        valid.col    = FALSE),
       method = 'render',
       table.classes = 'table-condensed')
```

## Visualization with Multiple Dimensions

**What are the characteristics of each neighborhood?**

First, I decided to visualize the number of accommodation listings by room types and neighborhood groups.

1)  Most listings are located in Manhattan or Brooklyn.
2)  There are few listings of shared rooms compared to entire home and private rooms.

```{r}
accommodation_df %>% 
  filter(outlier==0) %>%
  group_by(neighbourhood_group, room_type)%>%
  dplyr::summarize(
    frequency = n()
  ) %>%
  ggplot(aes(fill=room_type, y=frequency, x=neighbourhood_group)) + geom_bar(position="stack", stat="identity") +
  labs(title = "The number of accommodation listings by room types and neibhgourhood groups")
```

I decided to visualize the price data by neighbourhood group with a violin plot because it allows us to see the distribution and the volume for several groups.

I learned that, 1) Manhattan is the most expensive neighbourhood regardless of room types.
2) Even though Manhattan and Brooklyn are typically more expensive than other neighbourhoods, they offer more reasonable entire home/apartments as well.

```{r}
accommodation_df %>% filter(outlier==0) %>%
  ggplot(aes(neighbourhood_group, price)) +
  geom_violin() +
  facet_wrap(vars(room_type)) + 
  theme(axis.text.x = element_text(angle = 45)) +
  labs(title = "Price per night by room types and neighbourhood groups") +
  coord_flip()
  
```

```{r}
accommodation_df %>% filter(outlier == 0) %>%
  ggplot(aes(reviews_per_month)) + 
  geom_boxplot() + 
  facet_grid(neighbourhood_group~room_type) +
  scale_x_continuous(limits=c(0,30), breaks=seq(0,30,5)) +
  labs(title = "The number of reviews per month by room types and neighbourhood groups")
  
```