DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 6

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Challenge Overview
  • Read in data
    • Briefly describe the data
  • Tidy Data (as needed)
  • Time Dependent Visualization
  • Visualizing Part-Whole Relationships

Challenge 6

challenge_6
hotel_bookings
Aleacia Messiah
tidyverse
ggplot2
summarytools
lubridate
treemap
Visualizing Time and Relationships
Author

Aleacia Messiah

Published

December 4, 2022

library(tidyverse)
library(ggplot2)
library(summarytools)
library(lubridate)
library(treemap)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. tidy data (as needed, including sanity checks)
  3. mutate variables as needed (including sanity checks)
  4. create at least one graph including time (evolution)
  • try to make them “publication” ready (optional)
  • Explain why you choose the specific graph type
  1. Create at least one graph depicting part-whole or flow relationships
  • try to make them “publication” ready (optional)
  • Explain why you choose the specific graph type

R Graph Gallery is a good starting point for thinking about what information is conveyed in standard graph types, and includes example R code.

(be sure to only include the category tags for the data you use!)

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

  • debt ⭐
  • fed_rate ⭐⭐
  • abc_poll ⭐⭐⭐
  • usa_hh ⭐⭐⭐
  • hotel_bookings ⭐⭐⭐⭐
  • AB_NYC ⭐⭐⭐⭐⭐
# read in hotel data
hotel_orig <- read_csv("_data/hotel_bookings.csv")
# view hotel data
hotel_orig
# A tibble: 119,390 × 32
   hotel  is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
   <chr>    <dbl>   <dbl>   <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
 1 Resor…       0     342    2015 July         27       1       0       0      2
 2 Resor…       0     737    2015 July         27       1       0       0      2
 3 Resor…       0       7    2015 July         27       1       0       1      1
 4 Resor…       0      13    2015 July         27       1       0       1      1
 5 Resor…       0      14    2015 July         27       1       0       2      2
 6 Resor…       0      14    2015 July         27       1       0       2      2
 7 Resor…       0       0    2015 July         27       1       0       2      2
 8 Resor…       0       9    2015 July         27       1       0       2      2
 9 Resor…       1      85    2015 July         27       1       0       3      2
10 Resor…       1      75    2015 July         27       1       0       3      2
# … with 119,380 more rows, 22 more variables: children <dbl>, babies <dbl>,
#   meal <chr>, country <chr>, market_segment <chr>,
#   distribution_channel <chr>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
#   reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
#   deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
#   customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …
# view summary of hotel data
dfSummary(hotel_orig)
Data Frame Summary  
hotel_orig  
Dimensions: 119390 x 32  
Duplicates: 31994  

-----------------------------------------------------------------------------------------------------------------------------------
No   Variable                         Stats / Values             Freqs (% of Valid)     Graph                  Valid      Missing  
---- -------------------------------- -------------------------- ---------------------- ---------------------- ---------- ---------
1    hotel                            1. City Hotel              79330 (66.4%)          IIIIIIIIIIIII          119390     0        
     [character]                      2. Resort Hotel            40060 (33.6%)          IIIIII                 (100.0%)   (0.0%)   

2    is_canceled                      Min  : 0                   0 : 75166 (63.0%)      IIIIIIIIIIII           119390     0        
     [numeric]                        Mean : 0.4                 1 : 44224 (37.0%)      IIIIIII                (100.0%)   (0.0%)   
                                      Max  : 1                                                                                     

3    lead_time                        Mean (sd) : 104 (106.9)    479 distinct values    :                      119390     0        
     [numeric]                        min < med < max:                                  :                      (100.0%)   (0.0%)   
                                      0 < 69 < 737                                      :                                          
                                      IQR (CV) : 142 (1)                                : : .                                      
                                                                                        : : : . .                                  

4    arrival_date_year                Mean (sd) : 2016.2 (0.7)   2015 : 21996 (18.4%)   III                    119390     0        
     [numeric]                        min < med < max:           2016 : 56707 (47.5%)   IIIIIIIII              (100.0%)   (0.0%)   
                                      2015 < 2016 < 2017         2017 : 40687 (34.1%)   IIIIII                                     
                                      IQR (CV) : 1 (0)                                                                             

5    arrival_date_month               1. August                  13877 (11.6%)          II                     119390     0        
     [character]                      2. July                    12661 (10.6%)          II                     (100.0%)   (0.0%)   
                                      3. May                     11791 ( 9.9%)          I                                          
                                      4. October                 11160 ( 9.3%)          I                                          
                                      5. April                   11089 ( 9.3%)          I                                          
                                      6. June                    10939 ( 9.2%)          I                                          
                                      7. September               10508 ( 8.8%)          I                                          
                                      8. March                    9794 ( 8.2%)          I                                          
                                      9. February                 8068 ( 6.8%)          I                                          
                                      10. November                6794 ( 5.7%)          I                                          
                                      [ 2 others ]               12709 (10.6%)          II                                         

6    arrival_date_week_number         Mean (sd) : 27.2 (13.6)    53 distinct values           . : . . .        119390     0        
     [numeric]                        min < med < max:                                    . : : : : : :        (100.0%)   (0.0%)   
                                      1 < 28 < 53                                       . : : : : : : : : :                        
                                      IQR (CV) : 22 (0.5)                               : : : : : : : : : :                        
                                                                                        : : : : : : : : : :                        

7    arrival_date_day_of_month        Mean (sd) : 15.8 (8.8)     31 distinct values     :                      119390     0        
     [numeric]                        min < med < max:                                  : : : . : : . : :      (100.0%)   (0.0%)   
                                      1 < 16 < 31                                       : : : : : : : : : :                        
                                      IQR (CV) : 15 (0.6)                               : : : : : : : : : :                        
                                                                                        : : : : : : : : : :                        

8    stays_in_weekend_nights          Mean (sd) : 0.9 (1)        17 distinct values     :                      119390     0        
     [numeric]                        min < med < max:                                  :                      (100.0%)   (0.0%)   
                                      0 < 1 < 19                                        :                                          
                                      IQR (CV) : 2 (1.1)                                : :                                        
                                                                                        : :                                        

9    stays_in_week_nights             Mean (sd) : 2.5 (1.9)      35 distinct values     :                      119390     0        
     [numeric]                        min < med < max:                                  :                      (100.0%)   (0.0%)   
                                      0 < 2 < 50                                        :                                          
                                      IQR (CV) : 2 (0.8)                                :                                          
                                                                                        :                                          

10   adults                           Mean (sd) : 1.9 (0.6)      14 distinct values     :                      119390     0        
     [numeric]                        min < med < max:                                  :                      (100.0%)   (0.0%)   
                                      0 < 2 < 55                                        :                                          
                                      IQR (CV) : 0 (0.3)                                :                                          
                                                                                        :                                          

11   children                         Mean (sd) : 0.1 (0.4)      0 : 110796 (92.8%)     IIIIIIIIIIIIIIIIII     119386     4        
     [numeric]                        min < med < max:           1 :   4861 ( 4.1%)                            (100.0%)   (0.0%)   
                                      0 < 0 < 10                 2 :   3652 ( 3.1%)                                                
                                      IQR (CV) : 0 (3.8)         3 :     76 ( 0.1%)                                                
                                                                 10 :      1 ( 0.0%)                                               

12   babies                           Mean (sd) : 0 (0.1)        0 : 118473 (99.2%)     IIIIIIIIIIIIIIIIIII    119390     0        
     [numeric]                        min < med < max:           1 :    900 ( 0.8%)                            (100.0%)   (0.0%)   
                                      0 < 0 < 10                 2 :     15 ( 0.0%)                                                
                                      IQR (CV) : 0 (12.3)        9 :      1 ( 0.0%)                                                
                                                                 10 :      1 ( 0.0%)                                               

13   meal                             1. BB                      92310 (77.3%)          IIIIIIIIIIIIIII        119390     0        
     [character]                      2. FB                        798 ( 0.7%)                                 (100.0%)   (0.0%)   
                                      3. HB                      14463 (12.1%)          II                                         
                                      4. SC                      10650 ( 8.9%)          I                                          
                                      5. Undefined                1169 ( 1.0%)                                                     

14   country                          1. PRT                     48590 (40.7%)          IIIIIIII               119390     0        
     [character]                      2. GBR                     12129 (10.2%)          II                     (100.0%)   (0.0%)   
                                      3. FRA                     10415 ( 8.7%)          I                                          
                                      4. ESP                      8568 ( 7.2%)          I                                          
                                      5. DEU                      7287 ( 6.1%)          I                                          
                                      6. ITA                      3766 ( 3.2%)                                                     
                                      7. IRL                      3375 ( 2.8%)                                                     
                                      8. BEL                      2342 ( 2.0%)                                                     
                                      9. BRA                      2224 ( 1.9%)                                                     
                                      10. NLD                     2104 ( 1.8%)                                                     
                                      [ 168 others ]             18590 (15.6%)          III                                        

15   market_segment                   1. Aviation                  237 ( 0.2%)                                 119390     0        
     [character]                      2. Complementary             743 ( 0.6%)                                 (100.0%)   (0.0%)   
                                      3. Corporate                5295 ( 4.4%)                                                     
                                      4. Direct                  12606 (10.6%)          II                                         
                                      5. Groups                  19811 (16.6%)          III                                        
                                      6. Offline TA/TO           24219 (20.3%)          IIII                                       
                                      7. Online TA               56477 (47.3%)          IIIIIIIII                                  
                                      8. Undefined                   2 ( 0.0%)                                                     

16   distribution_channel             1. Corporate                6677 ( 5.6%)          I                      119390     0        
     [character]                      2. Direct                  14645 (12.3%)          II                     (100.0%)   (0.0%)   
                                      3. GDS                       193 ( 0.2%)                                                     
                                      4. TA/TO                   97870 (82.0%)          IIIIIIIIIIIIIIII                           
                                      5. Undefined                   5 ( 0.0%)                                                     

17   is_repeated_guest                Min  : 0                   0 : 115580 (96.8%)     IIIIIIIIIIIIIIIIIII    119390     0        
     [numeric]                        Mean : 0                   1 :   3810 ( 3.2%)                            (100.0%)   (0.0%)   
                                      Max  : 1                                                                                     

18   previous_cancellations           Mean (sd) : 0.1 (0.8)      15 distinct values     :                      119390     0        
     [numeric]                        min < med < max:                                  :                      (100.0%)   (0.0%)   
                                      0 < 0 < 26                                        :                                          
                                      IQR (CV) : 0 (9.7)                                :                                          
                                                                                        :                                          

19   previous_bookings_not_canceled   Mean (sd) : 0.1 (1.5)      73 distinct values     :                      119390     0        
     [numeric]                        min < med < max:                                  :                      (100.0%)   (0.0%)   
                                      0 < 0 < 72                                        :                                          
                                      IQR (CV) : 0 (10.9)                               :                                          
                                                                                        :                                          

20   reserved_room_type               1. A                       85994 (72.0%)          IIIIIIIIIIIIII         119390     0        
     [character]                      2. B                        1118 ( 0.9%)                                 (100.0%)   (0.0%)   
                                      3. C                         932 ( 0.8%)                                                     
                                      4. D                       19201 (16.1%)          III                                        
                                      5. E                        6535 ( 5.5%)          I                                          
                                      6. F                        2897 ( 2.4%)                                                     
                                      7. G                        2094 ( 1.8%)                                                     
                                      8. H                         601 ( 0.5%)                                                     
                                      9. L                           6 ( 0.0%)                                                     
                                      10. P                         12 ( 0.0%)                                                     

21   assigned_room_type               1. A                       74053 (62.0%)          IIIIIIIIIIII           119390     0        
     [character]                      2. D                       25322 (21.2%)          IIII                   (100.0%)   (0.0%)   
                                      3. E                        7806 ( 6.5%)          I                                          
                                      4. F                        3751 ( 3.1%)                                                     
                                      5. G                        2553 ( 2.1%)                                                     
                                      6. C                        2375 ( 2.0%)                                                     
                                      7. B                        2163 ( 1.8%)                                                     
                                      8. H                         712 ( 0.6%)                                                     
                                      9. I                         363 ( 0.3%)                                                     
                                      10. K                        279 ( 0.2%)                                                     
                                      [ 2 others ]                  13 ( 0.0%)                                                     

22   booking_changes                  Mean (sd) : 0.2 (0.7)      21 distinct values     :                      119390     0        
     [numeric]                        min < med < max:                                  :                      (100.0%)   (0.0%)   
                                      0 < 0 < 21                                        :                                          
                                      IQR (CV) : 0 (2.9)                                :                                          
                                                                                        :                                          

23   deposit_type                     1. No Deposit              104641 (87.6%)         IIIIIIIIIIIIIIIII      119390     0        
     [character]                      2. Non Refund               14587 (12.2%)         II                     (100.0%)   (0.0%)   
                                      3. Refundable                 162 ( 0.1%)                                                    

24   agent                            1. 9                       31961 (26.8%)          IIIII                  119390     0        
     [character]                      2. NULL                    16340 (13.7%)          II                     (100.0%)   (0.0%)   
                                      3. 240                     13922 (11.7%)          II                                         
                                      4. 1                        7191 ( 6.0%)          I                                          
                                      5. 14                       3640 ( 3.0%)                                                     
                                      6. 7                        3539 ( 3.0%)                                                     
                                      7. 6                        3290 ( 2.8%)                                                     
                                      8. 250                      2870 ( 2.4%)                                                     
                                      9. 241                      1721 ( 1.4%)                                                     
                                      10. 28                      1666 ( 1.4%)                                                     
                                      [ 324 others ]             33250 (27.8%)          IIIII                                      

25   company                          1. NULL                    112593 (94.3%)         IIIIIIIIIIIIIIIIII     119390     0        
     [character]                      2. 40                         927 ( 0.8%)                                (100.0%)   (0.0%)   
                                      3. 223                        784 ( 0.7%)                                                    
                                      4. 67                         267 ( 0.2%)                                                    
                                      5. 45                         250 ( 0.2%)                                                    
                                      6. 153                        215 ( 0.2%)                                                    
                                      7. 174                        149 ( 0.1%)                                                    
                                      8. 219                        141 ( 0.1%)                                                    
                                      9. 281                        138 ( 0.1%)                                                    
                                      10. 154                       133 ( 0.1%)                                                    
                                      [ 343 others ]               3793 ( 3.2%)                                                    

26   days_in_waiting_list             Mean (sd) : 2.3 (17.6)     128 distinct values    :                      119390     0        
     [numeric]                        min < med < max:                                  :                      (100.0%)   (0.0%)   
                                      0 < 0 < 391                                       :                                          
                                      IQR (CV) : 0 (7.6)                                :                                          
                                                                                        :                                          

27   customer_type                    1. Contract                 4076 ( 3.4%)                                 119390     0        
     [character]                      2. Group                     577 ( 0.5%)                                 (100.0%)   (0.0%)   
                                      3. Transient               89613 (75.1%)          IIIIIIIIIIIIIII                            
                                      4. Transient-Party         25124 (21.0%)          IIII                                       

28   adr                              Mean (sd) : 101.8 (50.5)   8879 distinct values   :                      119390     0        
     [numeric]                        min < med < max:                                  :                      (100.0%)   (0.0%)   
                                      -6.4 < 94.6 < 5400                                :                                          
                                      IQR (CV) : 56.7 (0.5)                             :                                          
                                                                                        :                                          

29   required_car_parking_spaces      Mean (sd) : 0.1 (0.2)      0 : 111974 (93.8%)     IIIIIIIIIIIIIIIIII     119390     0        
     [numeric]                        min < med < max:           1 :   7383 ( 6.2%)     I                      (100.0%)   (0.0%)   
                                      0 < 0 < 8                  2 :     28 ( 0.0%)                                                
                                      IQR (CV) : 0 (3.9)         3 :      3 ( 0.0%)                                                
                                                                 8 :      2 ( 0.0%)                                                

30   total_of_special_requests        Mean (sd) : 0.6 (0.8)      0 : 70318 (58.9%)      IIIIIIIIIII            119390     0        
     [numeric]                        min < med < max:           1 : 33226 (27.8%)      IIIII                  (100.0%)   (0.0%)   
                                      0 < 0 < 5                  2 : 12969 (10.9%)      II                                         
                                      IQR (CV) : 1 (1.4)         3 :  2497 ( 2.1%)                                                 
                                                                 4 :   340 ( 0.3%)                                                 
                                                                 5 :    40 ( 0.0%)                                                 

31   reservation_status               1. Canceled                43017 (36.0%)          IIIIIII                119390     0        
     [character]                      2. Check-Out               75166 (63.0%)          IIIIIIIIIIII           (100.0%)   (0.0%)   
                                      3. No-Show                  1207 ( 1.0%)                                                     

32   reservation_status_date          min : 2014-10-17           926 distinct values            . : : : :      119390     0        
     [Date]                           med : 2016-08-07                                        : : : : : : .    (100.0%)   (0.0%)   
                                      max : 2017-09-14                                      . : : : : : : :                        
                                      range : 2y 10m 28d                                    : : : : : : : :                        
                                                                                        .   : : : : : : : :                        
-----------------------------------------------------------------------------------------------------------------------------------

Briefly describe the data

This dataset is about the details of hotel bookings of a city hotel and resort hotel, consisting of 119,390 observations (customers/bookings) with 32 variables (hotel, arrival_date_year, adults, children, babies, deposit_type, assigned_room_type, etc). There are no missing values except four missing in the children column. This dataset also measures bookings for each day of the month, each week, and each month for years 2015 to 2017. Based on the summary output, the highest number of bookings occurs in August with 13,877 observations (11.6% of the data) with July (12,661 observations) and May (11,791 observations) following behind. This makes sense since May through August are summer months and most people are on vacation during this time.

Tidy Data (as needed)

Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.

This data is mostly tidy so the only thing that needs to be done is to combine the arrival_date_year, arrival_date_month, and arrival_date_day_of_month variables into one column as the arrival_date, similar to the date format of reservation_status_date.

hotel <- hotel_orig %>% 
  # combine the date columns into one column
  unite(arrival_date_year, arrival_date_month, arrival_date_day_of_month,
        col = arrival_date, sep = "-")
# view new hotel data
hotel
# A tibble: 119,390 × 30
   hotel   is_ca…¹ lead_…² arriv…³ arriv…⁴ stays…⁵ stays…⁶ adults child…⁷ babies
   <chr>     <dbl>   <dbl> <chr>     <dbl>   <dbl>   <dbl>  <dbl>   <dbl>  <dbl>
 1 Resort…       0     342 2015-J…      27       0       0      2       0      0
 2 Resort…       0     737 2015-J…      27       0       0      2       0      0
 3 Resort…       0       7 2015-J…      27       0       1      1       0      0
 4 Resort…       0      13 2015-J…      27       0       1      1       0      0
 5 Resort…       0      14 2015-J…      27       0       2      2       0      0
 6 Resort…       0      14 2015-J…      27       0       2      2       0      0
 7 Resort…       0       0 2015-J…      27       0       2      2       0      0
 8 Resort…       0       9 2015-J…      27       0       2      2       0      0
 9 Resort…       1      85 2015-J…      27       0       3      2       0      0
10 Resort…       1      75 2015-J…      27       0       3      2       0      0
# … with 119,380 more rows, 20 more variables: meal <chr>, country <chr>,
#   market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
#   reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
#   deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
#   customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>,
#   total_of_special_requests <dbl>, reservation_status <chr>, …

Are there any variables that require mutation to be usable in your analysis stream? For example, do you need to calculate new values in order to graph them? Can string values be represented numerically? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?

Document your work here.

There are several variables (hotel, meal, reserved_room_type, etc.) that need to be converted into factors and arrival_date needs to be converted into a date format.

hotel <- hotel %>% 
  # convert hotel, is_canceled, etc., columns to factors
  mutate(across(c(hotel, is_canceled, meal:is_repeated_guest, reserved_room_type:assigned_room_type, deposit_type:company, 
                  customer_type, reservation_status), factor)) 
# convert arrival_date into date format
hotel$arrival_date <- ymd(hotel$arrival_date)
# view new hotel dataset
hotel
# A tibble: 119,390 × 30
   hotel       is_ca…¹ lead_…² arrival_…³ arriv…⁴ stays…⁵ stays…⁶ adults child…⁷
   <fct>       <fct>     <dbl> <date>       <dbl>   <dbl>   <dbl>  <dbl>   <dbl>
 1 Resort Hot… 0           342 2015-07-01      27       0       0      2       0
 2 Resort Hot… 0           737 2015-07-01      27       0       0      2       0
 3 Resort Hot… 0             7 2015-07-01      27       0       1      1       0
 4 Resort Hot… 0            13 2015-07-01      27       0       1      1       0
 5 Resort Hot… 0            14 2015-07-01      27       0       2      2       0
 6 Resort Hot… 0            14 2015-07-01      27       0       2      2       0
 7 Resort Hot… 0             0 2015-07-01      27       0       2      2       0
 8 Resort Hot… 0             9 2015-07-01      27       0       2      2       0
 9 Resort Hot… 1            85 2015-07-01      27       0       3      2       0
10 Resort Hot… 1            75 2015-07-01      27       0       3      2       0
# … with 119,380 more rows, 21 more variables: babies <dbl>, meal <fct>,
#   country <fct>, market_segment <fct>, distribution_channel <fct>,
#   is_repeated_guest <fct>, previous_cancellations <dbl>,
#   previous_bookings_not_canceled <dbl>, reserved_room_type <fct>,
#   assigned_room_type <fct>, booking_changes <dbl>, deposit_type <fct>,
#   agent <fct>, company <fct>, days_in_waiting_list <dbl>,
#   customer_type <fct>, adr <dbl>, required_car_parking_spaces <dbl>, …

Time Dependent Visualization

# create a time series graph of adr vs. arrival_date
ggplot(hotel, aes(`arrival_date`, `adr`, color = `hotel`)) +
  geom_line()

I chose a time series graph because it best portrays the change of adr over time as arrival_date increases. It appears that adr is relatively stable over time with the exception of the adr between January 2016 and July 2016, which may be an outlier. adr also doesn’t vary much depending on the type of hotel but there are a few peaks where the resort hotel has a higher adr than the city hotel.

Visualizing Part-Whole Relationships

# create a bar graph based on deposit_type
ggplot(hotel, aes(`deposit_type`, fill = `hotel`)) +
  geom_bar()

# create hotel_new dataset with summarized counts of assigned_room_type by hotel
hotel_new <- hotel %>% 
  group_by(hotel, assigned_room_type) %>% 
  summarize(n = n())
# create a treemap based on hotel and assigned_room_type
treemap(hotel_new,
       index = c("hotel", "assigned_room_type"),
       vSize = "n",
       type = "index")

I chose a bar graph based on deposit_type because it best depicts the different deposit types and shows the count of each one based on the type of hotel. This graph clearly displays that a majority of bookings from both hotels do not require a deposit beforehand. If there is a deposit prior to booking, the city hotel has more non-refundable deposits than the resort hotel and the resort hotel has a small amount of refundable deposits.

I chose a treemap based on hotel and assigned_room_type because this helps visualize the counts of each assigned room type in a better way by illustrating different blocks of various sizes corresponding to the number of assigned room types. This treemap shows the resort hotel has more room types than the city hotel but the city hotel has the largest amount of “A” assigned room types. The city hotel square is larger than the resort hotel square, meaning there are more assigned rooms in the city hotel than the resort hotel.

Source Code
---
title: "Challenge 6"
author: "Aleacia Messiah"
description: "Visualizing Time and Relationships"
date: "12/04/2022"
format:
  html:
    toc: true
    code-copy: true
    code-tools: true
categories:
  - challenge_6
  - hotel_bookings
  - Aleacia Messiah
  - tidyverse
  - ggplot2
  - summarytools
  - lubridate
  - treemap
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)
library(ggplot2)
library(summarytools)
library(lubridate)
library(treemap)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

## Challenge Overview

Today's challenge is to:

1)  read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
2)  tidy data (as needed, including sanity checks)
3)  mutate variables as needed (including sanity checks)
4)  create at least one graph including time (evolution)
   - try to make them "publication" ready (optional)
   - Explain why you choose the specific graph type
5)  Create at least one graph depicting part-whole or flow relationships
   - try to make them "publication" ready (optional)
   - Explain why you choose the specific graph type

[R Graph Gallery](https://r-graph-gallery.com/) is a good starting point for thinking about what information is conveyed in standard graph types, and includes example R code.

(be sure to only include the category tags for the data you use!)

## Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

  - debt  ⭐
  - fed_rate ⭐⭐
  - abc_poll ⭐⭐⭐
  - usa_hh ⭐⭐⭐
  - hotel_bookings ⭐⭐⭐⭐
  - AB_NYC ⭐⭐⭐⭐⭐

  

```{r}
# read in hotel data
hotel_orig <- read_csv("_data/hotel_bookings.csv")
# view hotel data
hotel_orig
# view summary of hotel data
dfSummary(hotel_orig)
```

### Briefly describe the data

This dataset is about the details of hotel bookings of a city hotel and resort hotel, consisting of 119,390 observations (customers/bookings) with 32 variables (`hotel`, `arrival_date_year`, `adults`, `children`, `babies`, `deposit_type`, `assigned_room_type`, etc). There are no missing values except four missing in the `children` column. This dataset also measures bookings for each day of the month, each week, and each month for years 2015 to 2017. Based on the summary output, the highest number of bookings occurs in August with 13,877 observations (11.6% of the data) with July (12,661 observations) and May (11,791 observations) following behind. This makes sense since May through August are summer months and most people are on vacation during this time. 

## Tidy Data (as needed)

Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.

This data is mostly tidy so the only thing that needs to be done is to combine the `arrival_date_year`, `arrival_date_month`, and `arrival_date_day_of_month` variables into one column as the `arrival_date`, similar to the date format of `reservation_status_date`. 

```{r}
hotel <- hotel_orig %>% 
  # combine the date columns into one column
  unite(arrival_date_year, arrival_date_month, arrival_date_day_of_month,
        col = arrival_date, sep = "-")
# view new hotel data
hotel
```

Are there any variables that require mutation to be usable in your analysis stream? For example, do you need to calculate new values in order to graph them? Can string values be represented numerically? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?

Document your work here.

There are several variables (`hotel`, `meal`, `reserved_room_type`, etc.) that need to be converted into factors and `arrival_date` needs to be converted into a date format. 

```{r}
hotel <- hotel %>% 
  # convert hotel, is_canceled, etc., columns to factors
  mutate(across(c(hotel, is_canceled, meal:is_repeated_guest, reserved_room_type:assigned_room_type, deposit_type:company, 
                  customer_type, reservation_status), factor)) 
# convert arrival_date into date format
hotel$arrival_date <- ymd(hotel$arrival_date)
# view new hotel dataset
hotel
```

## Time Dependent Visualization

```{r}
# create a time series graph of adr vs. arrival_date
ggplot(hotel, aes(`arrival_date`, `adr`, color = `hotel`)) +
  geom_line()
```

I chose a time series graph because it best portrays the change of `adr` over time as `arrival_date` increases. It appears that `adr` is relatively stable over time with the exception of the adr between January 2016 and July 2016, which may be an outlier. `adr` also doesn't vary much depending on the type of hotel but there are a few peaks where the resort hotel has a higher adr than the city hotel. 

## Visualizing Part-Whole Relationships

```{r}
# create a bar graph based on deposit_type
ggplot(hotel, aes(`deposit_type`, fill = `hotel`)) +
  geom_bar()
# create hotel_new dataset with summarized counts of assigned_room_type by hotel
hotel_new <- hotel %>% 
  group_by(hotel, assigned_room_type) %>% 
  summarize(n = n())
# create a treemap based on hotel and assigned_room_type
treemap(hotel_new,
       index = c("hotel", "assigned_room_type"),
       vSize = "n",
       type = "index")
```

I chose a bar graph based on `deposit_type` because it best depicts the different deposit types and shows the count of each one based on the type of hotel. This graph clearly displays that a majority of bookings from both hotels do not require a deposit beforehand. If there is a deposit prior to booking, the city hotel has more non-refundable deposits than the resort hotel and the resort hotel has a small amount of refundable deposits. 

I chose a treemap based on `hotel` and `assigned_room_type` because this helps visualize the counts of each assigned room type in a better way by illustrating different blocks of various sizes corresponding to the number of assigned room types. This treemap shows the resort hotel has more room types than the city hotel but the city hotel has the largest amount of "A" assigned room types. The city hotel square is larger than the resort hotel square, meaning there are more assigned rooms in the city hotel than the resort hotel.