hw2
shreya
hotel_bookings.csv
Reading in Data
Author

Shreya Varma

Published

June 25, 2023

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE)

Instructions

For this homework, your goal is to read in a more complicated dataset. Please use the category tag “hw2” as well as a tag for the dataset you choose to use.

  1. Read in a dataset. It’s strongly recommended that you choose a dataset you’re considering using for the final project. If you decide to use one of the datasets we have provided, please use a challenging dataset - check with us if you are not sure.
  2. Clean the data as needed using dplyr and related tidyverse packages. Provide a narrative about the data set (look it up if you aren’t sure what you have got) and the variables in your dataset, including what type of data each variable is. The goal of this step is to communicate in a visually appealing way to non-experts - not to replicate r-code.
  3. Identify potential research questions that your dataset can help answer.

Read in the dataset

I will be using the Hotel Bookings dataset for my Homework. I have imported it using the read_csv() function and will use the glimpse() function the see the columns it has. On a high level it seems to have the information of hotel type and its customer data like arrival departure information, number of people, their booking details, payment type and reservation details. The data has 119,390 rows and 32 columns. To get more insights I will use the summary() function. From the summary we can see that the data captured is from 2015 to 2017.

Code
hotel_bookings <- read_csv("_data/hotel_bookings.csv")
Rows: 119390 Columns: 32
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (13): hotel, arrival_date_month, meal, country, market_segment, distrib...
dbl  (18): is_canceled, lead_time, arrival_date_year, arrival_date_week_numb...
date  (1): reservation_status_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
hotel_bookings
# A tibble: 119,390 × 32
   hotel        is_canceled lead_time arrival_date_year arrival_date_month
   <chr>              <dbl>     <dbl>             <dbl> <chr>             
 1 Resort Hotel           0       342              2015 July              
 2 Resort Hotel           0       737              2015 July              
 3 Resort Hotel           0         7              2015 July              
 4 Resort Hotel           0        13              2015 July              
 5 Resort Hotel           0        14              2015 July              
 6 Resort Hotel           0        14              2015 July              
 7 Resort Hotel           0         0              2015 July              
 8 Resort Hotel           0         9              2015 July              
 9 Resort Hotel           1        85              2015 July              
10 Resort Hotel           1        75              2015 July              
# ℹ 119,380 more rows
# ℹ 27 more variables: arrival_date_week_number <dbl>,
#   arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>,
#   stays_in_week_nights <dbl>, adults <dbl>, children <dbl>, babies <dbl>,
#   meal <chr>, country <chr>, market_segment <chr>,
#   distribution_channel <chr>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>, …
Code
glimpse(hotel_bookings)
Rows: 119,390
Columns: 32
$ hotel                          <chr> "Resort Hotel", "Resort Hotel", "Resort…
$ is_canceled                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, …
$ lead_time                      <dbl> 342, 737, 7, 13, 14, 14, 0, 9, 85, 75, …
$ arrival_date_year              <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 201…
$ arrival_date_month             <chr> "July", "July", "July", "July", "July",…
$ arrival_date_week_number       <dbl> 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,…
$ arrival_date_day_of_month      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ stays_in_weekend_nights        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ stays_in_week_nights           <dbl> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4, …
$ adults                         <dbl> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ children                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ babies                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ meal                           <chr> "BB", "BB", "BB", "BB", "BB", "BB", "BB…
$ country                        <chr> "PRT", "PRT", "GBR", "GBR", "GBR", "GBR…
$ market_segment                 <chr> "Direct", "Direct", "Direct", "Corporat…
$ distribution_channel           <chr> "Direct", "Direct", "Direct", "Corporat…
$ is_repeated_guest              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ previous_cancellations         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ previous_bookings_not_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ reserved_room_type             <chr> "C", "C", "A", "A", "A", "A", "C", "C",…
$ assigned_room_type             <chr> "C", "C", "C", "A", "A", "A", "C", "C",…
$ booking_changes                <dbl> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ deposit_type                   <chr> "No Deposit", "No Deposit", "No Deposit…
$ agent                          <chr> "NULL", "NULL", "NULL", "304", "240", "…
$ company                        <chr> "NULL", "NULL", "NULL", "NULL", "NULL",…
$ days_in_waiting_list           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ customer_type                  <chr> "Transient", "Transient", "Transient", …
$ adr                            <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98.00,…
$ required_car_parking_spaces    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ total_of_special_requests      <dbl> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 3, …
$ reservation_status             <chr> "Check-Out", "Check-Out", "Check-Out", …
$ reservation_status_date        <date> 2015-07-01, 2015-07-01, 2015-07-02, 20…
Code
summary(hotel_bookings)
    hotel            is_canceled       lead_time   arrival_date_year
 Length:119390      Min.   :0.0000   Min.   :  0   Min.   :2015     
 Class :character   1st Qu.:0.0000   1st Qu.: 18   1st Qu.:2016     
 Mode  :character   Median :0.0000   Median : 69   Median :2016     
                    Mean   :0.3704   Mean   :104   Mean   :2016     
                    3rd Qu.:1.0000   3rd Qu.:160   3rd Qu.:2017     
                    Max.   :1.0000   Max.   :737   Max.   :2017     
                                                                    
 arrival_date_month arrival_date_week_number arrival_date_day_of_month
 Length:119390      Min.   : 1.00            Min.   : 1.0             
 Class :character   1st Qu.:16.00            1st Qu.: 8.0             
 Mode  :character   Median :28.00            Median :16.0             
                    Mean   :27.17            Mean   :15.8             
                    3rd Qu.:38.00            3rd Qu.:23.0             
                    Max.   :53.00            Max.   :31.0             
                                                                      
 stays_in_weekend_nights stays_in_week_nights     adults      
 Min.   : 0.0000         Min.   : 0.0         Min.   : 0.000  
 1st Qu.: 0.0000         1st Qu.: 1.0         1st Qu.: 2.000  
 Median : 1.0000         Median : 2.0         Median : 2.000  
 Mean   : 0.9276         Mean   : 2.5         Mean   : 1.856  
 3rd Qu.: 2.0000         3rd Qu.: 3.0         3rd Qu.: 2.000  
 Max.   :19.0000         Max.   :50.0         Max.   :55.000  
                                                              
    children           babies              meal             country         
 Min.   : 0.0000   Min.   : 0.000000   Length:119390      Length:119390     
 1st Qu.: 0.0000   1st Qu.: 0.000000   Class :character   Class :character  
 Median : 0.0000   Median : 0.000000   Mode  :character   Mode  :character  
 Mean   : 0.1039   Mean   : 0.007949                                        
 3rd Qu.: 0.0000   3rd Qu.: 0.000000                                        
 Max.   :10.0000   Max.   :10.000000                                        
 NA's   :4                                                                  
 market_segment     distribution_channel is_repeated_guest
 Length:119390      Length:119390        Min.   :0.00000  
 Class :character   Class :character     1st Qu.:0.00000  
 Mode  :character   Mode  :character     Median :0.00000  
                                         Mean   :0.03191  
                                         3rd Qu.:0.00000  
                                         Max.   :1.00000  
                                                          
 previous_cancellations previous_bookings_not_canceled reserved_room_type
 Min.   : 0.00000       Min.   : 0.0000                Length:119390     
 1st Qu.: 0.00000       1st Qu.: 0.0000                Class :character  
 Median : 0.00000       Median : 0.0000                Mode  :character  
 Mean   : 0.08712       Mean   : 0.1371                                  
 3rd Qu.: 0.00000       3rd Qu.: 0.0000                                  
 Max.   :26.00000       Max.   :72.0000                                  
                                                                         
 assigned_room_type booking_changes   deposit_type          agent          
 Length:119390      Min.   : 0.0000   Length:119390      Length:119390     
 Class :character   1st Qu.: 0.0000   Class :character   Class :character  
 Mode  :character   Median : 0.0000   Mode  :character   Mode  :character  
                    Mean   : 0.2211                                        
                    3rd Qu.: 0.0000                                        
                    Max.   :21.0000                                        
                                                                           
   company          days_in_waiting_list customer_type           adr         
 Length:119390      Min.   :  0.000      Length:119390      Min.   :  -6.38  
 Class :character   1st Qu.:  0.000      Class :character   1st Qu.:  69.29  
 Mode  :character   Median :  0.000      Mode  :character   Median :  94.58  
                    Mean   :  2.321                         Mean   : 101.83  
                    3rd Qu.:  0.000                         3rd Qu.: 126.00  
                    Max.   :391.000                         Max.   :5400.00  
                                                                             
 required_car_parking_spaces total_of_special_requests reservation_status
 Min.   :0.00000             Min.   :0.0000            Length:119390     
 1st Qu.:0.00000             1st Qu.:0.0000            Class :character  
 Median :0.00000             Median :0.0000            Mode  :character  
 Mean   :0.06252             Mean   :0.5714                              
 3rd Qu.:0.00000             3rd Qu.:1.0000                              
 Max.   :8.00000             Max.   :5.0000                              
                                                                         
 reservation_status_date
 Min.   :2014-10-17     
 1st Qu.:2016-02-01     
 Median :2016-08-07     
 Mean   :2016-07-30     
 3rd Qu.:2017-02-08     
 Max.   :2017-09-14     
                        

Clean the data

We can now start cleaning the dataset. I will be checking if any columns have duplicate values, missing values, null values and remove irrelavent columns. I will also use data visualization to understand the data better.

  1. Duplicate values: I see that the data has duplicate rows thus I will drop it using the unique() function. Now the new data has 87,392 rows.
Code
duplicate_rows <- hotel_bookings[duplicated(hotel_bookings) | duplicated(hotel_bookings, fromLast = TRUE), ]
duplicate_rows
# A tibble: 40,165 × 32
   hotel        is_canceled lead_time arrival_date_year arrival_date_month
   <chr>              <dbl>     <dbl>             <dbl> <chr>             
 1 Resort Hotel           0        14              2015 July              
 2 Resort Hotel           0        14              2015 July              
 3 Resort Hotel           0        72              2015 July              
 4 Resort Hotel           0        72              2015 July              
 5 Resort Hotel           0        70              2015 July              
 6 Resort Hotel           0        70              2015 July              
 7 Resort Hotel           1         5              2015 July              
 8 Resort Hotel           1         5              2015 July              
 9 Resort Hotel           0         0              2015 July              
10 Resort Hotel           0         0              2015 July              
# ℹ 40,155 more rows
# ℹ 27 more variables: arrival_date_week_number <dbl>,
#   arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>,
#   stays_in_week_nights <dbl>, adults <dbl>, children <dbl>, babies <dbl>,
#   meal <chr>, country <chr>, market_segment <chr>,
#   distribution_channel <chr>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>, …
Code
hotel_bookings <- unique(hotel_bookings)

glimpse(hotel_bookings)
Rows: 87,396
Columns: 32
$ hotel                          <chr> "Resort Hotel", "Resort Hotel", "Resort…
$ is_canceled                    <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, …
$ lead_time                      <dbl> 342, 737, 7, 13, 14, 0, 9, 85, 75, 23, …
$ arrival_date_year              <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 201…
$ arrival_date_month             <chr> "July", "July", "July", "July", "July",…
$ arrival_date_week_number       <dbl> 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,…
$ arrival_date_day_of_month      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ stays_in_weekend_nights        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ stays_in_week_nights           <dbl> 0, 0, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 4, …
$ adults                         <dbl> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ children                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, …
$ babies                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ meal                           <chr> "BB", "BB", "BB", "BB", "BB", "BB", "FB…
$ country                        <chr> "PRT", "PRT", "GBR", "GBR", "GBR", "PRT…
$ market_segment                 <chr> "Direct", "Direct", "Direct", "Corporat…
$ distribution_channel           <chr> "Direct", "Direct", "Direct", "Corporat…
$ is_repeated_guest              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ previous_cancellations         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ previous_bookings_not_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ reserved_room_type             <chr> "C", "C", "A", "A", "A", "C", "C", "A",…
$ assigned_room_type             <chr> "C", "C", "C", "A", "A", "C", "C", "A",…
$ booking_changes                <dbl> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, …
$ deposit_type                   <chr> "No Deposit", "No Deposit", "No Deposit…
$ agent                          <chr> "NULL", "NULL", "NULL", "304", "240", "…
$ company                        <chr> "NULL", "NULL", "NULL", "NULL", "NULL",…
$ days_in_waiting_list           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ customer_type                  <chr> "Transient", "Transient", "Transient", …
$ adr                            <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 107.00…
$ required_car_parking_spaces    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ total_of_special_requests      <dbl> 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 3, 1, …
$ reservation_status             <chr> "Check-Out", "Check-Out", "Check-Out", …
$ reservation_status_date        <date> 2015-07-01, 2015-07-01, 2015-07-02, 20…
  1. Missing and Null values: We see that there are no missing values in the rows we have
Code
count_missing <- sum(is.na(hotel_bookings))
count_missing
[1] 4
  1. Removing irrelevant data: Since 82k rows in Company column does not contain any data I am dropping the column.
Code
head(hotel_bookings)
# A tibble: 6 × 32
  hotel        is_canceled lead_time arrival_date_year arrival_date_month
  <chr>              <dbl>     <dbl>             <dbl> <chr>             
1 Resort Hotel           0       342              2015 July              
2 Resort Hotel           0       737              2015 July              
3 Resort Hotel           0         7              2015 July              
4 Resort Hotel           0        13              2015 July              
5 Resort Hotel           0        14              2015 July              
6 Resort Hotel           0         0              2015 July              
# ℹ 27 more variables: arrival_date_week_number <dbl>,
#   arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>,
#   stays_in_week_nights <dbl>, adults <dbl>, children <dbl>, babies <dbl>,
#   meal <chr>, country <chr>, market_segment <chr>,
#   distribution_channel <chr>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
#   reserved_room_type <chr>, assigned_room_type <chr>, …
Code
table(hotel_bookings$company)

   10   100   101   102   103   104   105   106   107   108   109    11   110 
    1     1     1     1    16     1     5     2     6     9     1     1    48 
  112   113   115   116   118    12   120   122   126   127   130   132   135 
   13    36     4     6     7    13    13     1     1     8     9     1    52 
  137   139    14   140   142   143   144   146   148   149   150   153   154 
    4     3     9     1     1    11    20     3    36     2    18   206   133 
  158   159    16   160   163   165   167   168   169   174   178   179    18 
    2     4     5     1    17     3     7     2    53   121    25    24     1 
  180   183   184   185   186   192   193   195   197    20   200   202   203 
    5    16     1     2    11     4     4    34    13    31     3     1    10 
  204   207   209   210   212   213   215   216   217   218   219    22   220 
   33     9    14     2     1     1     5    21     2    31   131     5     4 
  221   222   223   224   225   227   229   230   232   233   234   237   238 
   27     2   503     3     7    20     1     2     2    95     1     1    30 
  240   242   243   245   246   250   251   253   254   255   257   258   259 
    3    15     2     2     2     2    17     1     7     6     1     1     2 
  260   263   264   268   269   270   271   272   273   274   275   277   278 
    3    14     2    11    33    43     1     3     1    12     3     4     2 
  279    28   280   281   282   284   286   287   288   289    29   290   291 
    3     5    36   119     4     1    18     5     1     2     1    17    12 
  292   293   297   301   302   304   305   307   308   309    31   311   312 
   17     5     7     1     4     2     1    25    15     1    13     2     2 
  313   314   316   317   318   319    32   320   321   323   324   325   329 
    1     1     2     8     1     3     1     1     2    10     9     2    11 
  330   331   332   333   334   337   338    34   341   342   343   346   347 
    4    60     2     4     3    17    12     8     5    18    14     8     1 
  348   349    35   350   351   352   353   355   356   357   358   360   361 
    7     2     1     3     2     1     4    10    10     5     7     9     2 
  362   364   365   366   367   368   369    37   370   371   372   373   376 
    2     5    16     9     9     1     4     1     2    10     3     1     1 
  377   378   379    38   380   382   383   384   385   386   388    39   390 
    2     3     9    22    12     4     5     8     3     1     7     8    13 
  391   392   393   394   395   396   397   398   399    40   400   401   402 
    2     3     1     5     3    12     7     1    11   851     2     1     1 
  403   405   407   408   409   410   411   412   413   415   416   417   418 
    2    77     9    15     9     2     2     1     1     1     1     1    24 
  419    42   420   421   422   423   424   425   426   428   429    43   433 
    1     5     1     8     1     2     8     1     4    13     2    13     1 
  435   436   437   439   442   443   444   445   446   447   448    45   450 
   10     2     6     6     1     4     4     4     1     2     4   238    10 
  451   452   454   455   456   457   458   459    46   460   461   465   466 
    4     3     1     1     2     3     2     5    24     2     1    12     3 
   47   470   477   478   479    48   481   482   483   484   485   486   487 
   62     5    14     2     1     2     1     2     1     2    14     2     1 
  489    49   490   491   492   494   496   497   498   499   501   504   506 
    1     5     5     2     2     4     1     1    34     1     1     9     1 
  507    51   511   512   513   514   515   516   518    52   520   521   523 
    8    80     5     3     2     2     6     1     2     1     1     5     8 
  525   528    53   530   531   534   539    54   541   543    59     6    61 
   11     2     8     4     1     1     2     1     1     2     7     1     2 
   62    64    65    67    68    71    72    73    76    77    78     8    80 
   44     1     1    44    26     1    29     3     1     1    15     1     1 
   81    82    83    84    85    86    88     9    91    92    93    94    96 
   22    14     9     3     2    32    18    36    46    13     3    76     1 
   99  NULL 
   12 82137 
Code
hotel_bookings <- hotel_bookings[, -which(names(hotel_bookings) == "company")]
glimpse(hotel_bookings)
Rows: 87,396
Columns: 31
$ hotel                          <chr> "Resort Hotel", "Resort Hotel", "Resort…
$ is_canceled                    <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, …
$ lead_time                      <dbl> 342, 737, 7, 13, 14, 0, 9, 85, 75, 23, …
$ arrival_date_year              <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 201…
$ arrival_date_month             <chr> "July", "July", "July", "July", "July",…
$ arrival_date_week_number       <dbl> 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,…
$ arrival_date_day_of_month      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ stays_in_weekend_nights        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ stays_in_week_nights           <dbl> 0, 0, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 4, …
$ adults                         <dbl> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ children                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, …
$ babies                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ meal                           <chr> "BB", "BB", "BB", "BB", "BB", "BB", "FB…
$ country                        <chr> "PRT", "PRT", "GBR", "GBR", "GBR", "PRT…
$ market_segment                 <chr> "Direct", "Direct", "Direct", "Corporat…
$ distribution_channel           <chr> "Direct", "Direct", "Direct", "Corporat…
$ is_repeated_guest              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ previous_cancellations         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ previous_bookings_not_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ reserved_room_type             <chr> "C", "C", "A", "A", "A", "C", "C", "A",…
$ assigned_room_type             <chr> "C", "C", "C", "A", "A", "C", "C", "A",…
$ booking_changes                <dbl> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, …
$ deposit_type                   <chr> "No Deposit", "No Deposit", "No Deposit…
$ agent                          <chr> "NULL", "NULL", "NULL", "304", "240", "…
$ days_in_waiting_list           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ customer_type                  <chr> "Transient", "Transient", "Transient", …
$ adr                            <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 107.00…
$ required_car_parking_spaces    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ total_of_special_requests      <dbl> 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 3, 1, …
$ reservation_status             <chr> "Check-Out", "Check-Out", "Check-Out", …
$ reservation_status_date        <date> 2015-07-01, 2015-07-01, 2015-07-02, 20…
  1. Data visualization: We see that most of the bookings comes from the distribution channel TA/TO. Thus most people book through travel agents or tour operators. To check which months see most number of tourists I created a frequency bar plot of all months. February and June seem to have the highest number of tourists.
Code
barplot(table(hotel_bookings$distribution_channel))

Code
month_counts <- table(hotel_bookings$arrival_date_month)
barplot(month_counts, names.arg = month.name)

Research Questions

Weekends vs Weekdays: Do most number of tourists book on weekdays or weekends?

Lead time: We can analyse the lead time to check in how advance people do their bookings and if it is dependent on any of the distribution channels.

Market segment: We can check which market segment is most predominant and if the distribution channel depends on the market segment

Length of stay: We can check if the length of stay varies across customer segments or hotel types

Are there differences in booking behavior and preferences between domestic and international guests?