DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Homework - 2

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Read Data
  • Describe Data
  • Analysis
  • My Potential research questions

Homework - 2

hw2
hotel_bookings
Reading In data
Author

Neeharika Karanam

Published

December 6, 2022

library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.5.0 
✔ readr   2.1.3      ✔ forcats 0.5.2 
Warning: package 'ggplot2' was built under R version 4.2.2
Warning: package 'tibble' was built under R version 4.2.2
Warning: package 'stringr' was built under R version 4.2.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library(psych)

Attaching package: 'psych'

The following objects are masked from 'package:ggplot2':

    %+%, alpha
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read Data

I would like to read in the hotel bookings dataset.

#Read the data
hotel.bookings <- read_csv("_data/hotel_bookings.csv")

Describe Data

Now I would like describe the data make some transformations to display the dataset.

This gives the dimesions of the dataset.

dim(hotel.bookings)
[1] 119390     32

This gives the column names of the dataset.

colnames(hotel.bookings)
 [1] "hotel"                          "is_canceled"                   
 [3] "lead_time"                      "arrival_date_year"             
 [5] "arrival_date_month"             "arrival_date_week_number"      
 [7] "arrival_date_day_of_month"      "stays_in_weekend_nights"       
 [9] "stays_in_week_nights"           "adults"                        
[11] "children"                       "babies"                        
[13] "meal"                           "country"                       
[15] "market_segment"                 "distribution_channel"          
[17] "is_repeated_guest"              "previous_cancellations"        
[19] "previous_bookings_not_canceled" "reserved_room_type"            
[21] "assigned_room_type"             "booking_changes"               
[23] "deposit_type"                   "agent"                         
[25] "company"                        "days_in_waiting_list"          
[27] "customer_type"                  "adr"                           
[29] "required_car_parking_spaces"    "total_of_special_requests"     
[31] "reservation_status"             "reservation_status_date"       

This changes the name of the column name

colnames(hotel.bookings)[28] <- "average_daily_rate"
hotel.bookings
# A tibble: 119,390 × 32
   hotel  is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
   <chr>    <dbl>   <dbl>   <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
 1 Resor…       0     342    2015 July         27       1       0       0      2
 2 Resor…       0     737    2015 July         27       1       0       0      2
 3 Resor…       0       7    2015 July         27       1       0       1      1
 4 Resor…       0      13    2015 July         27       1       0       1      1
 5 Resor…       0      14    2015 July         27       1       0       2      2
 6 Resor…       0      14    2015 July         27       1       0       2      2
 7 Resor…       0       0    2015 July         27       1       0       2      2
 8 Resor…       0       9    2015 July         27       1       0       2      2
 9 Resor…       1      85    2015 July         27       1       0       3      2
10 Resor…       1      75    2015 July         27       1       0       3      2
# … with 119,380 more rows, 22 more variables: children <dbl>, babies <dbl>,
#   meal <chr>, country <chr>, market_segment <chr>,
#   distribution_channel <chr>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
#   reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
#   deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
#   customer_type <chr>, average_daily_rate <dbl>, …

This gives the character types of all the columns in the dataset.

table(sapply(hotel.bookings, function(x) typeof(x)))

character    double 
       13        19 
sapply(
  hotel.bookings, function(x) n_distinct(x)
  )
                         hotel                    is_canceled 
                             2                              2 
                     lead_time              arrival_date_year 
                           479                              3 
            arrival_date_month       arrival_date_week_number 
                            12                             53 
     arrival_date_day_of_month        stays_in_weekend_nights 
                            31                             17 
          stays_in_week_nights                         adults 
                            35                             14 
                      children                         babies 
                             6                              5 
                          meal                        country 
                             5                            178 
                market_segment           distribution_channel 
                             8                              5 
             is_repeated_guest         previous_cancellations 
                             2                             15 
previous_bookings_not_canceled             reserved_room_type 
                            73                             10 
            assigned_room_type                booking_changes 
                            12                             21 
                  deposit_type                          agent 
                             3                            334 
                       company           days_in_waiting_list 
                           353                            128 
                 customer_type             average_daily_rate 
                             4                           8879 
   required_car_parking_spaces      total_of_special_requests 
                             5                              6 
            reservation_status        reservation_status_date 
                             3                            926 

This gives the unique values of the dataset.

unique(hotel.bookings$hotel)
[1] "Resort Hotel" "City Hotel"  
unique(hotel.bookings$arrival_date_year)
[1] 2015 2016 2017
unique(hotel.bookings$reservation_status)
[1] "Check-Out" "Canceled"  "No-Show"  
unique(hotel.bookings$distribution_channel)
[1] "Direct"    "Corporate" "TA/TO"     "Undefined" "GDS"      
unique(hotel.bookings$customer_type)
[1] "Transient"       "Contract"        "Transient-Party" "Group"          

In the above dataset we can see that there are 119,390 rows and 32 columns. The column names gives us more information about what the data is about and here the data is the reservation data of the hotels. I could also find that there are 14 columns which have character vales and 18 of them are numerical values. There are quite a few columns which are dates in the dataset.

I have also observed that the column names are quite descriptive and abbreviated for easy understandability. I have found that a column which is names ‘adr’ which should actually mean “average daily rate” So, I went ahead and changed the column name to “average_daily_rate”.

In the dataset there are two different types of the hotels namely : Resort hotel and the City Hotel. The dataset consists of the reservations made from 2015 - 2017. The dataset also covers the reservations from all over the world and has almost 178 countries in them. There is data of both the cancelled reservations as well as the completed reservations along with the no-shows as well. Therefore, each row includes a lot of the information about the different reservations like the type of the hotel, country, number of visitors, dates, daily rates, duration of the stay and some other categorical information about the customer as well as the reservation.

In this dataset, each column represents a variable and each row is an booking. Therefore, there is no pivoting is needed.

This helps in mutating the variables from the dataset and it adds two new variables.

hotel.bookings <- mutate(
  hotel.bookings, 
  number_of_guests = adults + children + babies,
  total_stay = stays_in_weekend_nights + stays_in_week_nights)

print(summary(hotel.bookings, 
                varnumbers= FALSE, 
                plain.ascii= FALSE, 
                style= "grid", 
                graph.magnif= 0.80, 
                valid.col= TRUE),
      method= 'render', 
      table.classes= 'table-condensed')
    hotel            is_canceled       lead_time   arrival_date_year
 Length:119390      Min.   :0.0000   Min.   :  0   Min.   :2015     
 Class :character   1st Qu.:0.0000   1st Qu.: 18   1st Qu.:2016     
 Mode  :character   Median :0.0000   Median : 69   Median :2016     
                    Mean   :0.3704   Mean   :104   Mean   :2016     
                    3rd Qu.:1.0000   3rd Qu.:160   3rd Qu.:2017     
                    Max.   :1.0000   Max.   :737   Max.   :2017     
                                                                    
 arrival_date_month arrival_date_week_number arrival_date_day_of_month
 Length:119390      Min.   : 1.00            Min.   : 1.0             
 Class :character   1st Qu.:16.00            1st Qu.: 8.0             
 Mode  :character   Median :28.00            Median :16.0             
                    Mean   :27.17            Mean   :15.8             
                    3rd Qu.:38.00            3rd Qu.:23.0             
                    Max.   :53.00            Max.   :31.0             
                                                                      
 stays_in_weekend_nights stays_in_week_nights     adults      
 Min.   : 0.0000         Min.   : 0.0         Min.   : 0.000  
 1st Qu.: 0.0000         1st Qu.: 1.0         1st Qu.: 2.000  
 Median : 1.0000         Median : 2.0         Median : 2.000  
 Mean   : 0.9276         Mean   : 2.5         Mean   : 1.856  
 3rd Qu.: 2.0000         3rd Qu.: 3.0         3rd Qu.: 2.000  
 Max.   :19.0000         Max.   :50.0         Max.   :55.000  
                                                              
    children           babies              meal             country         
 Min.   : 0.0000   Min.   : 0.000000   Length:119390      Length:119390     
 1st Qu.: 0.0000   1st Qu.: 0.000000   Class :character   Class :character  
 Median : 0.0000   Median : 0.000000   Mode  :character   Mode  :character  
 Mean   : 0.1039   Mean   : 0.007949                                        
 3rd Qu.: 0.0000   3rd Qu.: 0.000000                                        
 Max.   :10.0000   Max.   :10.000000                                        
 NA's   :4                                                                  
 market_segment     distribution_channel is_repeated_guest
 Length:119390      Length:119390        Min.   :0.00000  
 Class :character   Class :character     1st Qu.:0.00000  
 Mode  :character   Mode  :character     Median :0.00000  
                                         Mean   :0.03191  
                                         3rd Qu.:0.00000  
                                         Max.   :1.00000  
                                                          
 previous_cancellations previous_bookings_not_canceled reserved_room_type
 Min.   : 0.00000       Min.   : 0.0000                Length:119390     
 1st Qu.: 0.00000       1st Qu.: 0.0000                Class :character  
 Median : 0.00000       Median : 0.0000                Mode  :character  
 Mean   : 0.08712       Mean   : 0.1371                                  
 3rd Qu.: 0.00000       3rd Qu.: 0.0000                                  
 Max.   :26.00000       Max.   :72.0000                                  
                                                                         
 assigned_room_type booking_changes   deposit_type          agent          
 Length:119390      Min.   : 0.0000   Length:119390      Length:119390     
 Class :character   1st Qu.: 0.0000   Class :character   Class :character  
 Mode  :character   Median : 0.0000   Mode  :character   Mode  :character  
                    Mean   : 0.2211                                        
                    3rd Qu.: 0.0000                                        
                    Max.   :21.0000                                        
                                                                           
   company          days_in_waiting_list customer_type      average_daily_rate
 Length:119390      Min.   :  0.000      Length:119390      Min.   :  -6.38   
 Class :character   1st Qu.:  0.000      Class :character   1st Qu.:  69.29   
 Mode  :character   Median :  0.000      Mode  :character   Median :  94.58   
                    Mean   :  2.321                         Mean   : 101.83   
                    3rd Qu.:  0.000                         3rd Qu.: 126.00   
                    Max.   :391.000                         Max.   :5400.00   
                                                                              
 required_car_parking_spaces total_of_special_requests reservation_status
 Min.   :0.00000             Min.   :0.0000            Length:119390     
 1st Qu.:0.00000             1st Qu.:0.0000            Class :character  
 Median :0.00000             Median :0.0000            Mode  :character  
 Mean   :0.06252             Mean   :0.5714                              
 3rd Qu.:0.00000             3rd Qu.:1.0000                              
 Max.   :8.00000             Max.   :5.0000                              
                                                                         
 reservation_status_date number_of_guests   total_stay    
 Min.   :2014-10-17      Min.   : 0.000   Min.   : 0.000  
 1st Qu.:2016-02-01      1st Qu.: 2.000   1st Qu.: 2.000  
 Median :2016-08-07      Median : 2.000   Median : 3.000  
 Mean   :2016-07-30      Mean   : 1.968   Mean   : 3.428  
 3rd Qu.:2017-02-08      3rd Qu.: 2.000   3rd Qu.: 4.000  
 Max.   :2017-09-14      Max.   :55.000   Max.   :69.000  
                         NA's   :4                        

This table contains some descriptive statistics for the numeric variables of the data. For example, 37% of reservations are really canceled. Typically, reservations are made 104 days before the anticipated arrival date. 1.97 individuals are typically reserved for each reservation. A youngster or infant is included in one reservation out of every 10 on average. 3.43 days are the typical length of stay. After the fact changes are made to 22% of reservations. Last but not least, the typical daily rate for hotels is $101.

This checks for the NA values.

colSums(is.na(hotel.bookings))
                         hotel                    is_canceled 
                             0                              0 
                     lead_time              arrival_date_year 
                             0                              0 
            arrival_date_month       arrival_date_week_number 
                             0                              0 
     arrival_date_day_of_month        stays_in_weekend_nights 
                             0                              0 
          stays_in_week_nights                         adults 
                             0                              0 
                      children                         babies 
                             4                              0 
                          meal                        country 
                             0                              0 
                market_segment           distribution_channel 
                             0                              0 
             is_repeated_guest         previous_cancellations 
                             0                              0 
previous_bookings_not_canceled             reserved_room_type 
                             0                              0 
            assigned_room_type                booking_changes 
                             0                              0 
                  deposit_type                          agent 
                             0                              0 
                       company           days_in_waiting_list 
                             0                              0 
                 customer_type             average_daily_rate 
                             0                              0 
   required_car_parking_spaces      total_of_special_requests 
                             0                              0 
            reservation_status        reservation_status_date 
                             0                              0 
              number_of_guests                     total_stay 
                             4                              0 

The only numerical variables that lack values are four. But as we can see from the dataset itself and the summary table above, some “NULL” values are actually strings. The summary table’s Agent and Company variables both have “NULL” values. As a last sanity check, I’ll check each column separately to determine if it has the value “NULL.”

This checks for the columns having any NULL values.

nulls <- sapply(hotel.bookings, function(x) table(grepl("NULL", x)))

for (i in 1:32) {
  if (!is.na(nulls[[i]][2])) {
    print(nulls[i])
    }
}
$country

 FALSE   TRUE 
118902    488 

$agent

 FALSE   TRUE 
103050  16340 

$company

 FALSE   TRUE 
  6797 112593 

We can observe that there are 3 variables, country, agent and company which have “NULL” values. So now let’s find out how much of this data in the dataset is NULL.

round(100*prop.table(table(grepl("NULL", hotel.bookings$country))), 2)

FALSE  TRUE 
99.59  0.41 
round(100*prop.table(table(grepl("NULL", hotel.bookings$agent))), 2)

FALSE  TRUE 
86.31 13.69 
round(100*prop.table(table(grepl("NULL", hotel.bookings$company))), 2)

FALSE  TRUE 
 5.69 94.31 

We can now observe that there is 0.41% of country data, 13.89% of agent data and 94.31% of company data of reservations that are missing.

Analysis

According to the summary table, resort hotels only charge $508 per day compared to a city hotel’s potential daily rate of $5,400. I believe that is the genuine circumstance.

This gives us the average_daily_rate in the descending order.

hotel.bookings %>% 
  arrange(desc(average_daily_rate)) %>% 
  slice_head(n=10) %>% 
  select(hotel, arrival_date_year, 
         country, 
         agent, 
         number_of_guests, 
         total_stay, 
         reservation_status, 
         average_daily_rate)
# A tibble: 10 × 8
   hotel        arrival_date_year country agent number…¹ total…² reser…³ avera…⁴
   <chr>                    <dbl> <chr>   <chr>    <dbl>   <dbl> <chr>     <dbl>
 1 City Hotel                2016 PRT     12           2       1 Cancel…   5400 
 2 City Hotel                2017 ITA     159          1       1 Check-…    510 
 3 Resort Hotel              2015 PRT     NULL         2       1 Check-…    508 
 4 City Hotel                2016 PRT     NULL         4       2 Check-…    452.
 5 Resort Hotel              2017 PRT     314          2      14 Cancel…    450 
 6 Resort Hotel              2017 PRT     250          4       6 Cancel…    437 
 7 Resort Hotel              2017 PRT     250          4       8 Check-…    426.
 8 Resort Hotel              2017 ESP     240          4       5 Check-…    402 
 9 Resort Hotel              2017 MAR     250          5       8 Check-…    397.
10 Resort Hotel              2017 PRT     250          3      10 Cancel…    392 
# … with abbreviated variable names ¹​number_of_guests, ²​total_stay,
#   ³​reservation_status, ⁴​average_daily_rate

This gives us the average_daily_rate in the ascending order.

hotel.bookings %>% 
  arrange(average_daily_rate) %>% 
  slice_head(n=10) %>% 
  select(hotel, arrival_date_year, 
         country, 
         agent, 
         number_of_guests, 
         total_stay, 
         reservation_status, 
         average_daily_rate)
# A tibble: 10 × 8
   hotel        arrival_date_year country agent number…¹ total…² reser…³ avera…⁴
   <chr>                    <dbl> <chr>   <chr>    <dbl>   <dbl> <chr>     <dbl>
 1 Resort Hotel              2017 GBR     273          2      10 Check-…   -6.38
 2 Resort Hotel              2015 PRT     NULL         2       0 Check-…    0   
 3 Resort Hotel              2015 PRT     NULL         2       0 Check-…    0   
 4 Resort Hotel              2015 PRT     NULL         4       1 Check-…    0   
 5 Resort Hotel              2015 PRT     240          2       0 Check-…    0   
 6 Resort Hotel              2015 PRT     250          1       0 Check-…    0   
 7 Resort Hotel              2015 PRT     NULL         2       0 Check-…    0   
 8 Resort Hotel              2015 PRT     240          2       0 Check-…    0   
 9 Resort Hotel              2015 PRT     305          2       2 Cancel…    0   
10 Resort Hotel              2015 PRT     305          1       2 Check-…    0   
# … with abbreviated variable names ¹​number_of_guests, ²​total_stay,
#   ³​reservation_status, ⁴​average_daily_rate

As a result, it would seem that the row with the average daily cost of $5,400 is incorrect. There is also a row that has a negative average daily rate. I’ll take them both.

This will help us in filtering the data based on the average_daily_rate.

hotel.bookings <- hotel.bookings %>% filter(average_daily_rate>=0 & average_daily_rate<=510)

Now let us perform some grouping and filtering on the data to perform our analysis and understand a bit more in-dept.

hotel.bookings %>% 
  select(hotel, average_daily_rate) %>% 
  group_by(hotel) %>% 
  summarise_if(is.numeric, list(min = min, max = max, mean = mean, std_dev = sd, median = median), na.rm = TRUE)
# A tibble: 2 × 6
  hotel          min   max  mean std_dev median
  <chr>        <dbl> <dbl> <dbl>   <dbl>  <dbl>
1 City Hotel       0   510 105.     39.3   99.9
2 Resort Hotel     0   508  95.0    61.4   75  
hotel.bookings %>% 
  select(country) %>% 
  group_by(country) %>% 
  count() %>% 
  arrange(desc(n)) %>% 
  head(n=10)
# A tibble: 10 × 2
# Groups:   country [10]
   country     n
   <chr>   <int>
 1 PRT     48589
 2 GBR     12128
 3 FRA     10415
 4 ESP      8568
 5 DEU      7287
 6 ITA      3766
 7 IRL      3375
 8 BEL      2342
 9 BRA      2224
10 NLD      2104
hotel.bookings %>% 
  select(country, is_canceled) %>% 
  group_by(country) %>% 
  summarise_if(is.numeric, sum, na.rm = TRUE) %>% 
  arrange(desc(is_canceled)) %>% 
  head(n=10)
# A tibble: 10 × 2
   country is_canceled
   <chr>         <dbl>
 1 PRT           27518
 2 GBR            2453
 3 ESP            2177
 4 FRA            1934
 5 ITA            1333
 6 DEU            1218
 7 IRL             832
 8 BRA             830
 9 USA             501
10 BEL             474
hotel.bookings %>% 
  filter(country %in% c("PRT", "GBR", "ESP", "FRA", "ITA")) %>% 
  select(country,is_canceled) %>% 
  group_by(country) %>% 
  summarise_if(is.numeric, mean, na.rm = TRUE) %>% 
  arrange(desc(is_canceled))
# A tibble: 5 × 2
  country is_canceled
  <chr>         <dbl>
1 PRT           0.566
2 ITA           0.354
3 ESP           0.254
4 GBR           0.202
5 FRA           0.186
hotel.bookings %>% 
  filter(reservation_status == "Check-Out") %>% 
  select(country, number_of_guests) %>% 
  group_by(country) %>% 
  summarise_if(is.numeric, sum, na.rm = TRUE) %>% 
  arrange(desc(number_of_guests)) %>% 
  head(n=10)
# A tibble: 10 × 2
   country number_of_guests
   <chr>              <dbl>
 1 PRT                37670
 2 GBR                19256
 3 FRA                17299
 4 ESP                13213
 5 DEU                11659
 6 IRL                 5102
 7 ITA                 4894
 8 BEL                 3855
 9 NLD                 3374
10 USA                 3212
hotel.bookings %>% 
  filter(reservation_status == "Check-Out") %>% 
  select(country, number_of_guests) %>% 
  group_by(country) %>% 
  summarise_if(is.numeric, sum, na.rm = TRUE) %>% 
  arrange(desc(number_of_guests)) %>% 
  head(n=10)
# A tibble: 10 × 2
   country number_of_guests
   <chr>              <dbl>
 1 PRT                37670
 2 GBR                19256
 3 FRA                17299
 4 ESP                13213
 5 DEU                11659
 6 IRL                 5102
 7 ITA                 4894
 8 BEL                 3855
 9 NLD                 3374
10 USA                 3212
hotel.bookings %>% 
  select(country, average_daily_rate) %>% 
  group_by(country) %>% 
  summarise_if(is.numeric, mean, na.rm = TRUE) %>% 
  arrange(desc(average_daily_rate)) %>% 
  head(n=10)
# A tibble: 10 × 2
   country average_daily_rate
   <chr>                <dbl>
 1 DJI                   273 
 2 AIA                   265 
 3 AND                   203.
 4 UMI                   200 
 5 LAO                   182.
 6 MYT                   178.
 7 NCL                   176.
 8 GEO                   169.
 9 COM                   165.
10 FRO                   155.
table(hotel.bookings$arrival_date_month)

    April    August  December  February   January      July      June     March 
    11089     13877      6780      8068      5929     12661     10939      9792 
      May  November   October September 
    11791      6794     11160     10508 

The average nightly price difference between city hotels and resort hotels is $11. In contrast, resort hotels offer a wider range of prices than do hotels in cities.

The nations with the most reservations are Portugal, Great Britain, France, Spain, Germany, Italy, Ireland, Belgium, Brazil, and the Netherlands. Great Britain, France, and Spain are next on the list. However, as we can see, 56% of Portugal’s hotel reservations are actually canceled. Each country contributes 35% of this percentage, Italy and Spain. Nevertheless, during the course of three years, Portugal has welcomed a total of 37,670 guests.

According to average daily rates, the priciest hotels may be located in Djibouti, Anguilla, Andorra, United States Minor Outlying Islands, Laos, and so on. It seems that hotels are much more expensive in small countries with little tourists.

As observed the busiest times of the year for hotels are in August, July, and May, respectively.

Finally, I wanted to check how many rows have zero daily rate.

hotel.bookings %>% 
  filter(average_daily_rate == 0) %>% 
  count()
# A tibble: 1 × 1
      n
  <int>
1  1959

I see that there are 1959 reservations which have zero daily rate.

hotel.bookings %>% 
  filter(average_daily_rate == 0) %>% 
  group_by(country) %>% 
  count() %>% 
  arrange(desc(n)) %>% 
  head()
# A tibble: 6 × 2
# Groups:   country [6]
  country     n
  <chr>   <int>
1 PRT      1550
2 ESP        79
3 GBR        73
4 FRA        55
5 DEU        41
6 NULL       21

The majority of the zero values come from Portugal. Therefore, more investigation is required into the reliability of hotel data for Portugal.

My Potential research questions

  1. Which country has the maximum number of guests staying the hotels?
  2. In the different countries what is the average daily rate of the hotel booking and for which customer type?
  3. How many rows have zero daily rate and which of the country is majorly responsible for that?
  4. Do the room rates change when the duration of the stay increases in different countries?
Source Code
---
title: "Homework - 2"
author: "Neeharika Karanam"
description: "Reading In data"
date: "12/06/2022"
format:
  html:
    toc: true
    code-copy: true
    code-tools: true
categories:
  - hw2
  - hotel_bookings
---

```{r}
library(tidyverse)
library(psych)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

```


## Read Data

I would like to read in the hotel bookings dataset.

```{r}
#Read the data
hotel.bookings <- read_csv("_data/hotel_bookings.csv")

```

## Describe Data

Now I would like describe the data make some transformations to display the dataset.


This gives the dimesions of the dataset.
```{r}
dim(hotel.bookings)
```

This gives the column names of the dataset.
```{r}
colnames(hotel.bookings)
```

This changes the name of the column name
```{r}
colnames(hotel.bookings)[28] <- "average_daily_rate"
```

```{r}
hotel.bookings
```

This gives the character types of all the columns in the dataset.
```{r}
table(sapply(hotel.bookings, function(x) typeof(x)))
```

```{r}
sapply(
  hotel.bookings, function(x) n_distinct(x)
  )
```

This gives the unique values of the dataset.

```{r}
unique(hotel.bookings$hotel)
unique(hotel.bookings$arrival_date_year)
unique(hotel.bookings$reservation_status)
unique(hotel.bookings$distribution_channel)
unique(hotel.bookings$customer_type)
```

In the above dataset we can see that there are 119,390 rows and 32 columns. The column names gives us more information about what the data is about and here the data is the reservation data of the hotels. I could also find that there are 14 columns which have character vales and 18 of them are numerical values. There are quite a few columns which are dates in the dataset.

I have also observed that the column names are quite descriptive and abbreviated for easy understandability. I have found that a column which is names 'adr' which should actually mean "average daily rate" So, I went ahead and changed the column name to "average_daily_rate".

In the dataset there are two different types of the hotels namely : Resort hotel and the City Hotel. The dataset consists of the reservations made from 2015 - 2017. The dataset also covers the reservations from all over the world and has almost 178 countries in them. There is data of both the cancelled reservations as well as the completed reservations along with the no-shows as well. Therefore, each row includes a lot of the information about the different reservations like the type of the hotel, country, number of visitors, dates, daily rates, duration of the stay and some other categorical information about the customer as well as the reservation.

In this dataset, each column represents a variable and each row is an booking. Therefore, there is no pivoting is needed.

This helps in mutating the variables from the dataset and it adds two new variables.

```{r}
hotel.bookings <- mutate(
  hotel.bookings, 
  number_of_guests = adults + children + babies,
  total_stay = stays_in_weekend_nights + stays_in_week_nights)

print(summary(hotel.bookings, 
                varnumbers= FALSE, 
                plain.ascii= FALSE, 
                style= "grid", 
                graph.magnif= 0.80, 
                valid.col= TRUE),
      method= 'render', 
      table.classes= 'table-condensed')

```

This table contains some descriptive statistics for the numeric variables of the data. For example, 37% of reservations are really canceled. Typically, reservations are made 104 days before the anticipated arrival date. 1.97 individuals are typically reserved for each reservation. A youngster or infant is included in one reservation out of every 10 on average. 3.43 days are the typical length of stay. After the fact changes are made to 22% of reservations. Last but not least, the typical daily rate for hotels is $101.


This checks for the NA values.

```{r}
colSums(is.na(hotel.bookings))
```


The only numerical variables that lack values are four. But as we can see from the dataset itself and the summary table above, some "NULL" values are actually strings. The summary table's Agent and Company variables both have "NULL" values. As a last sanity check, I'll check each column separately to determine if it has the value "NULL."

This checks for the columns having any NULL values.

```{r}
nulls <- sapply(hotel.bookings, function(x) table(grepl("NULL", x)))

for (i in 1:32) {
  if (!is.na(nulls[[i]][2])) {
    print(nulls[i])
    }
}
  
```

We can observe that there are 3 variables, `country`, `agent` and `company` which have "NULL" values. So now let's find out how much of this data in the dataset is NULL.

```{r}
round(100*prop.table(table(grepl("NULL", hotel.bookings$country))), 2)
round(100*prop.table(table(grepl("NULL", hotel.bookings$agent))), 2)
round(100*prop.table(table(grepl("NULL", hotel.bookings$company))), 2)
```

We can now observe that there is 0.41% of `country` data, 13.89% of `agent` data and 94.31% of `company` data of reservations that are missing.

## Analysis

According to the summary table, resort hotels only charge $508 per day compared to a city hotel's potential daily rate of $5,400. I believe that is the genuine circumstance.

This gives us the average_daily_rate in the descending order.
```{r}
hotel.bookings %>% 
  arrange(desc(average_daily_rate)) %>% 
  slice_head(n=10) %>% 
  select(hotel, arrival_date_year, 
         country, 
         agent, 
         number_of_guests, 
         total_stay, 
         reservation_status, 
         average_daily_rate)
```
This gives us the average_daily_rate in the ascending order.

```{r}
hotel.bookings %>% 
  arrange(average_daily_rate) %>% 
  slice_head(n=10) %>% 
  select(hotel, arrival_date_year, 
         country, 
         agent, 
         number_of_guests, 
         total_stay, 
         reservation_status, 
         average_daily_rate)

```

As a result, it would seem that the row with the average daily cost of $5,400 is incorrect. There is also a row that has a negative average daily rate. I'll take them both.

This will help us in filtering the data based on the average_daily_rate.
```{r}
hotel.bookings <- hotel.bookings %>% filter(average_daily_rate>=0 & average_daily_rate<=510)
```

Now let us perform some grouping and filtering on the data to perform our analysis and understand a bit more in-dept.

```{r}
hotel.bookings %>% 
  select(hotel, average_daily_rate) %>% 
  group_by(hotel) %>% 
  summarise_if(is.numeric, list(min = min, max = max, mean = mean, std_dev = sd, median = median), na.rm = TRUE)
```

```{r}
hotel.bookings %>% 
  select(country) %>% 
  group_by(country) %>% 
  count() %>% 
  arrange(desc(n)) %>% 
  head(n=10)
```

```{r}
hotel.bookings %>% 
  select(country, is_canceled) %>% 
  group_by(country) %>% 
  summarise_if(is.numeric, sum, na.rm = TRUE) %>% 
  arrange(desc(is_canceled)) %>% 
  head(n=10)
```

```{r}
hotel.bookings %>% 
  filter(country %in% c("PRT", "GBR", "ESP", "FRA", "ITA")) %>% 
  select(country,is_canceled) %>% 
  group_by(country) %>% 
  summarise_if(is.numeric, mean, na.rm = TRUE) %>% 
  arrange(desc(is_canceled))
```

```{r}
hotel.bookings %>% 
  filter(reservation_status == "Check-Out") %>% 
  select(country, number_of_guests) %>% 
  group_by(country) %>% 
  summarise_if(is.numeric, sum, na.rm = TRUE) %>% 
  arrange(desc(number_of_guests)) %>% 
  head(n=10)
```

```{r}
hotel.bookings %>% 
  filter(reservation_status == "Check-Out") %>% 
  select(country, number_of_guests) %>% 
  group_by(country) %>% 
  summarise_if(is.numeric, sum, na.rm = TRUE) %>% 
  arrange(desc(number_of_guests)) %>% 
  head(n=10)
```

```{r}
hotel.bookings %>% 
  select(country, average_daily_rate) %>% 
  group_by(country) %>% 
  summarise_if(is.numeric, mean, na.rm = TRUE) %>% 
  arrange(desc(average_daily_rate)) %>% 
  head(n=10)
```

```{r}
table(hotel.bookings$arrival_date_month)
```

The average nightly price difference between city hotels and resort hotels is $11. In contrast, resort hotels offer a wider range of prices than do hotels in cities.

The nations with the most reservations are Portugal, Great Britain, France, Spain, Germany, Italy, Ireland, Belgium, Brazil, and the Netherlands. Great Britain, France, and Spain are next on the list. However, as we can see, 56% of Portugal's hotel reservations are actually canceled. Each country contributes 35% of this percentage, Italy and Spain. 
Nevertheless, during the course of three years, Portugal has welcomed a total of 37,670 guests.

According to average daily rates, the priciest hotels may be located in Djibouti, Anguilla, Andorra, United States Minor Outlying Islands, Laos, and so on. It seems that hotels are much more expensive in small countries with little tourists.

As observed the busiest times of the year for hotels are in August, July, and May, respectively.

Finally, I wanted to check how many rows have zero daily rate.

```{r}
hotel.bookings %>% 
  filter(average_daily_rate == 0) %>% 
  count()
```

I see that there are 1959 reservations which have zero daily rate.

```{r}
hotel.bookings %>% 
  filter(average_daily_rate == 0) %>% 
  group_by(country) %>% 
  count() %>% 
  arrange(desc(n)) %>% 
  head()

```
The majority of the zero values come from Portugal. Therefore, more investigation is required into the reliability of hotel data for Portugal.

## My Potential research questions

1. Which country has the maximum number of guests staying the hotels?
2. In the different countries what is the average daily rate of the hotel booking and for which customer type?
3. How many rows have zero daily rate and which of the country is majorly responsible for that?
4. Do the room rates change when the duration of the stay increases in different countries?