sathvik_hotel_bookings_data_hw3

Hotel bookings Data

sathvik_thogaru
08-18-2021

Importing data

This data set contains a single file which compares various booking information between hotels.

hotel_bookings <- read_csv("../../_data/hotel_bookings.csv")
head(hotel_bookings)
# A tibble: 6 × 32
  hotel        is_canceled lead_time arrival_date_ye… arrival_date_mo…
  <chr>              <dbl>     <dbl>            <dbl> <chr>           
1 Resort Hotel           0       342             2015 July            
2 Resort Hotel           0       737             2015 July            
3 Resort Hotel           0         7             2015 July            
4 Resort Hotel           0        13             2015 July            
5 Resort Hotel           0        14             2015 July            
6 Resort Hotel           0        14             2015 July            
# … with 27 more variables: arrival_date_week_number <dbl>,
#   arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>,
#   stays_in_week_nights <dbl>, adults <dbl>, children <dbl>,
#   babies <dbl>, meal <chr>, country <chr>, market_segment <chr>,
#   distribution_channel <chr>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>,
#   previous_bookings_not_canceled <dbl>, reserved_room_type <chr>, …

skim() is an alternative to summary(), quickly providing a broad overview of a data frame. It handles data of all types, dispatching a different set of summary functions based on the types of columns in the data frame.

skim(hotel_bookings)
Table 1: Data summary
Name hotel_bookings
Number of rows 119390
Number of columns 32
_______________________
Column type frequency:
character 13
Date 1
numeric 18
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
hotel 0 1 10 12 0 2 0
arrival_date_month 0 1 3 9 0 12 0
meal 0 1 2 9 0 5 0
country 0 1 2 4 0 178 0
market_segment 0 1 6 13 0 8 0
distribution_channel 0 1 3 9 0 5 0
reserved_room_type 0 1 1 1 0 10 0
assigned_room_type 0 1 1 1 0 12 0
deposit_type 0 1 10 10 0 3 0
agent 0 1 1 4 0 334 0
company 0 1 1 4 0 353 0
customer_type 0 1 5 15 0 4 0
reservation_status 0 1 7 9 0 3 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
reservation_status_date 0 1 2014-10-17 2017-09-14 2016-08-07 926

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
is_canceled 0 1 0.37 0.48 0.00 0.00 0.00 1 1 ▇▁▁▁▅
lead_time 0 1 104.01 106.86 0.00 18.00 69.00 160 737 ▇▂▁▁▁
arrival_date_year 0 1 2016.16 0.71 2015.00 2016.00 2016.00 2017 2017 ▃▁▇▁▆
arrival_date_week_number 0 1 27.17 13.61 1.00 16.00 28.00 38 53 ▅▇▇▇▅
arrival_date_day_of_month 0 1 15.80 8.78 1.00 8.00 16.00 23 31 ▇▇▇▇▆
stays_in_weekend_nights 0 1 0.93 1.00 0.00 0.00 1.00 2 19 ▇▁▁▁▁
stays_in_week_nights 0 1 2.50 1.91 0.00 1.00 2.00 3 50 ▇▁▁▁▁
adults 0 1 1.86 0.58 0.00 2.00 2.00 2 55 ▇▁▁▁▁
children 4 1 0.10 0.40 0.00 0.00 0.00 0 10 ▇▁▁▁▁
babies 0 1 0.01 0.10 0.00 0.00 0.00 0 10 ▇▁▁▁▁
is_repeated_guest 0 1 0.03 0.18 0.00 0.00 0.00 0 1 ▇▁▁▁▁
previous_cancellations 0 1 0.09 0.84 0.00 0.00 0.00 0 26 ▇▁▁▁▁
previous_bookings_not_canceled 0 1 0.14 1.50 0.00 0.00 0.00 0 72 ▇▁▁▁▁
booking_changes 0 1 0.22 0.65 0.00 0.00 0.00 0 21 ▇▁▁▁▁
days_in_waiting_list 0 1 2.32 17.59 0.00 0.00 0.00 0 391 ▇▁▁▁▁
adr 0 1 101.83 50.54 -6.38 69.29 94.58 126 5400 ▇▁▁▁▁
required_car_parking_spaces 0 1 0.06 0.25 0.00 0.00 0.00 0 8 ▇▁▁▁▁
total_of_special_requests 0 1 0.57 0.79 0.00 0.00 0.00 1 5 ▇▁▁▁▁

Data Wrangling

tidying the data

finding the na values in the dataframe( row and column) using which()

which(is.na(hotel_bookings), arr.ind=TRUE)
       row col
[1,] 40601  11
[2,] 40668  11
[3,] 40680  11
[4,] 41161  11
hotel_bookings[c(40601,40668,40680,41161),]
# A tibble: 4 × 32
  hotel      is_canceled lead_time arrival_date_year arrival_date_mon…
  <chr>            <dbl>     <dbl>             <dbl> <chr>            
1 City Hotel           1         2              2015 August           
2 City Hotel           1         1              2015 August           
3 City Hotel           1         1              2015 August           
4 City Hotel           1         8              2015 August           
# … with 27 more variables: arrival_date_week_number <dbl>,
#   arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>,
#   stays_in_week_nights <dbl>, adults <dbl>, children <dbl>,
#   babies <dbl>, meal <chr>, country <chr>, market_segment <chr>,
#   distribution_channel <chr>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>,
#   previous_bookings_not_canceled <dbl>, reserved_room_type <chr>, …
hotel_bookings <- filter(hotel_bookings, !is.na(children)) 
sum(is.na(hotel_bookings))
[1] 0

transforming data and visualizing

(hotel_bookings<-hotel_bookings %>% 
  mutate(arrival_month = recode(arrival_date_month,"January"=1,"February"=2,"March"=3,"April"=4,"May"=5,"June"=6,"July"=7,"August"=8,"September"=9,"October"=10,"November"=11,"December"=12)))
# A tibble: 119,386 × 33
   hotel        is_canceled lead_time arrival_date_ye… arrival_date_mo…
   <chr>              <dbl>     <dbl>            <dbl> <chr>           
 1 Resort Hotel           0       342             2015 July            
 2 Resort Hotel           0       737             2015 July            
 3 Resort Hotel           0         7             2015 July            
 4 Resort Hotel           0        13             2015 July            
 5 Resort Hotel           0        14             2015 July            
 6 Resort Hotel           0        14             2015 July            
 7 Resort Hotel           0         0             2015 July            
 8 Resort Hotel           0         9             2015 July            
 9 Resort Hotel           1        85             2015 July            
10 Resort Hotel           1        75             2015 July            
# … with 119,376 more rows, and 28 more variables:
#   arrival_date_week_number <dbl>, arrival_date_day_of_month <dbl>,
#   stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>,
#   adults <dbl>, children <dbl>, babies <dbl>, meal <chr>,
#   country <chr>, market_segment <chr>, distribution_channel <chr>,
#   is_repeated_guest <dbl>, previous_cancellations <dbl>,
#   previous_bookings_not_canceled <dbl>, reserved_room_type <chr>, …
(hotel_bookings <- hotel_bookings%>%
  mutate(arrival_date = make_date(arrival_date_year,arrival_month,arrival_date_day_of_month)))
# A tibble: 119,386 × 34
   hotel        is_canceled lead_time arrival_date_ye… arrival_date_mo…
   <chr>              <dbl>     <dbl>            <dbl> <chr>           
 1 Resort Hotel           0       342             2015 July            
 2 Resort Hotel           0       737             2015 July            
 3 Resort Hotel           0         7             2015 July            
 4 Resort Hotel           0        13             2015 July            
 5 Resort Hotel           0        14             2015 July            
 6 Resort Hotel           0        14             2015 July            
 7 Resort Hotel           0         0             2015 July            
 8 Resort Hotel           0         9             2015 July            
 9 Resort Hotel           1        85             2015 July            
10 Resort Hotel           1        75             2015 July            
# … with 119,376 more rows, and 29 more variables:
#   arrival_date_week_number <dbl>, arrival_date_day_of_month <dbl>,
#   stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>,
#   adults <dbl>, children <dbl>, babies <dbl>, meal <chr>,
#   country <chr>, market_segment <chr>, distribution_channel <chr>,
#   is_repeated_guest <dbl>, previous_cancellations <dbl>,
#   previous_bookings_not_canceled <dbl>, reserved_room_type <chr>, …

looking for type of hotels booked by guests

(hotels_info<-data.frame(table(hotel_bookings$hotel)))
          Var1  Freq
1   City Hotel 79326
2 Resort Hotel 40060
pie(hotels_info$Freq, labels=paste(hotels_info$Var1,sep = "=", hotels_info$Freq), main = "type of hotels booked by guests")

  1. where do most of the guests come from?
(guests_country_details <- hotel_bookings %>% 
   group_by(country) %>% 
   count() %>% 
   ungroup() %>% 
   arrange(desc(n)))
# A tibble: 178 × 2
   country     n
   <chr>   <int>
 1 PRT     48586
 2 GBR     12129
 3 FRA     10415
 4 ESP      8568
 5 DEU      7287
 6 ITA      3766
 7 IRL      3375
 8 BEL      2342
 9 BRA      2224
10 NLD      2104
# … with 168 more rows

most guests by top country’s

ggplot(filter(guests_country_details, n>1500))+
  geom_bar(aes(country, n), stat = "identity")+
  labs(y="number of guests", title = "most guests by country's")+
  coord_flip()

  1. how much do guests pay for a room at each hotel?
city_hotel_data<-filter(hotel_bookings, hotel_bookings$hotel=="City Hotel")
resort_hotel_data<-filter(hotel_bookings, hotel_bookings$hotel=="Resort Hotel")

adr - average daily rate

# city hotel
sort(unique(city_hotel_data$reserved_room_type))
[1] "A" "B" "C" "D" "E" "F" "G" "P"
# resort hotel
sort(unique(resort_hotel_data$reserved_room_type))
 [1] "A" "B" "C" "D" "E" "F" "G" "H" "L" "P"

city hotel average guests per room type

(city_hotel_average_guests_per_room_type <- city_hotel_data %>% 
   group_by(reserved_room_type) %>% 
   summarise(guests_per_room_type = mean(adults+children)))
# A tibble: 8 × 2
  reserved_room_type guests_per_room_type
  <chr>                             <dbl>
1 A                                  1.82
2 B                                  2.12
3 C                                  1.64
4 D                                  2.21
5 E                                  2.34
6 F                                  3.62
7 G                                  3.29
8 P                                  0   

city hotel average daily rate per room type

(city_hotel_adr_per_room_type_data <- city_hotel_data %>% 
   group_by(reserved_room_type) %>% 
   summarise(adr_per_room_type = mean(adr)))
# A tibble: 8 × 2
  reserved_room_type adr_per_room_type
  <chr>                          <dbl>
1 A                               96.2
2 B                               90.5
3 C                               85.5
4 D                              131. 
5 E                              157. 
6 F                              189. 
7 G                              202. 
8 P                                0  

resort hotel average guests per room type

(resort_average_guests_per_room_type <- resort_hotel_data %>% 
   group_by(reserved_room_type) %>% 
   summarise(guests_per_room_type = mean(adults+children)))
# A tibble: 10 × 2
   reserved_room_type guests_per_room_type
   <chr>                             <dbl>
 1 A                                  1.80
 2 B                                  2   
 3 C                                  3.34
 4 D                                  2.00
 5 E                                  1.99
 6 F                                  2.05
 7 G                                  3.37
 8 H                                  3.69
 9 L                                  2.17
10 P                                  0   

##resort hotel average daily rate per room type

(resort_hotel_adr_per_room_type_data <- resort_hotel_data %>% 
   group_by(reserved_room_type) %>% 
   summarise(adr_per_room_type = mean(adr)))
# A tibble: 10 × 2
   reserved_room_type adr_per_room_type
   <chr>                          <dbl>
 1 A                               76.2
 2 B                              105. 
 3 C                              161. 
 4 D                              104. 
 5 E                              114. 
 6 F                              133. 
 7 G                              168. 
 8 H                              188. 
 9 L                              125. 
10 P                                0  

visualizing Average daily rate of reserved room type at city hotel and resort hotel

ggplot(city_hotel_adr_per_room_type_data, aes(x=reserved_room_type, y= adr_per_room_type))+
  geom_bar(stat = "identity",fill = "steelblue", width = .5)+
  labs(x="reserved room type", y=" average daily rate per room", title = "Average daily rate of reserved room type at city hotel")+
  theme(plot.title = element_text(size = 14, hjust = 0.5, face = "bold"))

ggplot(resort_hotel_adr_per_room_type_data, aes(x=reserved_room_type, y= adr_per_room_type))+
  geom_bar(stat = "identity",fill = "steelblue", width = .5)+
  labs(x="reserved room type", y=" average daily rate per room", title = "Average daily rate of reserved room type at resort hotel")+
  theme(plot.title = element_text(size = 14, hjust = 0.5, face = "bold"))

  1. how does the price per night varry in the year in each hotel?

city hotel

city_room_prices_monthly <-select(city_hotel_data, arrival_date_month, adr)
(city_room_prices_monthly <- city_room_prices_monthly %>%
    group_by(arrival_date_month) %>%
    summarise(mean_room_prices = mean(adr)) %>% 
    ungroup() %>% 
  arrange(desc(mean_room_prices)))
# A tibble: 12 × 2
   arrival_date_month mean_room_prices
   <chr>                         <dbl>
 1 May                           122. 
 2 June                          119. 
 3 August                        115. 
 4 April                         111. 
 5 July                          111. 
 6 September                     110. 
 7 October                       100. 
 8 March                          92.6
 9 December                       88.8
10 November                       88.1
11 February                       85.1
12 January                        82.6
plot <- ggplot(city_room_prices_monthly)+
  geom_bar(aes(arrival_date_month, mean_room_prices),stat = "identity")+
  labs(x="arrival_date_month", y=" mean_room_prices", title = "mean room prices over the year at city hotel")+
  theme(plot.title = element_text(size = 14, hjust = 0.5, face = "bold"))
  

plot +
  scale_x_discrete(limits = c("January", "February", "March", "April", "May", "June", 
          "July", "August", "September", "October", "November", "December"))

resort hotel

resort_room_prices_monthly <-select(resort_hotel_data, arrival_date_month, adr)
(resort_room_prices_monthly <- resort_room_prices_monthly %>%
    group_by(arrival_date_month) %>%
    summarise(mean_room_prices = mean(adr)) %>% 
    ungroup() %>% 
  arrange(desc(mean_room_prices)))
# A tibble: 12 × 2
   arrival_date_month mean_room_prices
   <chr>                         <dbl>
 1 August                        187. 
 2 July                          155. 
 3 June                          110. 
 4 September                      93.3
 5 May                            78.8
 6 April                          77.8
 7 December                       69.0
 8 October                        62.1
 9 March                          57.5
10 February                       55.2
11 January                        49.5
12 November                       48.3
plot <- ggplot(resort_room_prices_monthly)+
  geom_bar(aes(arrival_date_month, mean_room_prices),stat = "identity")+
  labs(x="arrival_date_month", y=" mean_room_prices", title = "mean room prices over the year at resort hotel")+
  theme(plot.title = element_text(size = 14, hjust = 0.5, face = "bold"))
  

plot +
  scale_x_discrete(limits = c("January", "February", "March", "April", "May", "June", 
          "July", "August", "September", "October", "November", "December"))

  1. which months are the busiest?

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

sathvik_thogaru (2021, Aug. 18). DACSS 601 August 2021: sathvik_hotel_bookings_data_hw3. Retrieved from https://mrolfe.github.io/DACSS601August2021/posts/2021-08-18-sathvikhotelbookingsdatahw3/

BibTeX citation

@misc{sathvik_thogaru2021sathvik_hotel_bookings_data_hw3,
  author = {sathvik_thogaru, },
  title = {DACSS 601 August 2021: sathvik_hotel_bookings_data_hw3},
  url = {https://mrolfe.github.io/DACSS601August2021/posts/2021-08-18-sathvikhotelbookingsdatahw3/},
  year = {2021}
}