Hotel bookings Data
This data set contains a single file which compares various booking information between hotels.
# A tibble: 6 × 32
hotel is_canceled lead_time arrival_date_ye… arrival_date_mo…
<chr> <dbl> <dbl> <dbl> <chr>
1 Resort Hotel 0 342 2015 July
2 Resort Hotel 0 737 2015 July
3 Resort Hotel 0 7 2015 July
4 Resort Hotel 0 13 2015 July
5 Resort Hotel 0 14 2015 July
6 Resort Hotel 0 14 2015 July
# … with 27 more variables: arrival_date_week_number <dbl>,
# arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>,
# stays_in_week_nights <dbl>, adults <dbl>, children <dbl>,
# babies <dbl>, meal <chr>, country <chr>, market_segment <chr>,
# distribution_channel <chr>, is_repeated_guest <dbl>,
# previous_cancellations <dbl>,
# previous_bookings_not_canceled <dbl>, reserved_room_type <chr>, …
skim() is an alternative to summary(), quickly providing a broad overview of a data frame. It handles data of all types, dispatching a different set of summary functions based on the types of columns in the data frame.
skim(hotel_bookings)
Name | hotel_bookings |
Number of rows | 119390 |
Number of columns | 32 |
_______________________ | |
Column type frequency: | |
character | 13 |
Date | 1 |
numeric | 18 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
hotel | 0 | 1 | 10 | 12 | 0 | 2 | 0 |
arrival_date_month | 0 | 1 | 3 | 9 | 0 | 12 | 0 |
meal | 0 | 1 | 2 | 9 | 0 | 5 | 0 |
country | 0 | 1 | 2 | 4 | 0 | 178 | 0 |
market_segment | 0 | 1 | 6 | 13 | 0 | 8 | 0 |
distribution_channel | 0 | 1 | 3 | 9 | 0 | 5 | 0 |
reserved_room_type | 0 | 1 | 1 | 1 | 0 | 10 | 0 |
assigned_room_type | 0 | 1 | 1 | 1 | 0 | 12 | 0 |
deposit_type | 0 | 1 | 10 | 10 | 0 | 3 | 0 |
agent | 0 | 1 | 1 | 4 | 0 | 334 | 0 |
company | 0 | 1 | 1 | 4 | 0 | 353 | 0 |
customer_type | 0 | 1 | 5 | 15 | 0 | 4 | 0 |
reservation_status | 0 | 1 | 7 | 9 | 0 | 3 | 0 |
Variable type: Date
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
reservation_status_date | 0 | 1 | 2014-10-17 | 2017-09-14 | 2016-08-07 | 926 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
is_canceled | 0 | 1 | 0.37 | 0.48 | 0.00 | 0.00 | 0.00 | 1 | 1 | ▇▁▁▁▅ |
lead_time | 0 | 1 | 104.01 | 106.86 | 0.00 | 18.00 | 69.00 | 160 | 737 | ▇▂▁▁▁ |
arrival_date_year | 0 | 1 | 2016.16 | 0.71 | 2015.00 | 2016.00 | 2016.00 | 2017 | 2017 | ▃▁▇▁▆ |
arrival_date_week_number | 0 | 1 | 27.17 | 13.61 | 1.00 | 16.00 | 28.00 | 38 | 53 | ▅▇▇▇▅ |
arrival_date_day_of_month | 0 | 1 | 15.80 | 8.78 | 1.00 | 8.00 | 16.00 | 23 | 31 | ▇▇▇▇▆ |
stays_in_weekend_nights | 0 | 1 | 0.93 | 1.00 | 0.00 | 0.00 | 1.00 | 2 | 19 | ▇▁▁▁▁ |
stays_in_week_nights | 0 | 1 | 2.50 | 1.91 | 0.00 | 1.00 | 2.00 | 3 | 50 | ▇▁▁▁▁ |
adults | 0 | 1 | 1.86 | 0.58 | 0.00 | 2.00 | 2.00 | 2 | 55 | ▇▁▁▁▁ |
children | 4 | 1 | 0.10 | 0.40 | 0.00 | 0.00 | 0.00 | 0 | 10 | ▇▁▁▁▁ |
babies | 0 | 1 | 0.01 | 0.10 | 0.00 | 0.00 | 0.00 | 0 | 10 | ▇▁▁▁▁ |
is_repeated_guest | 0 | 1 | 0.03 | 0.18 | 0.00 | 0.00 | 0.00 | 0 | 1 | ▇▁▁▁▁ |
previous_cancellations | 0 | 1 | 0.09 | 0.84 | 0.00 | 0.00 | 0.00 | 0 | 26 | ▇▁▁▁▁ |
previous_bookings_not_canceled | 0 | 1 | 0.14 | 1.50 | 0.00 | 0.00 | 0.00 | 0 | 72 | ▇▁▁▁▁ |
booking_changes | 0 | 1 | 0.22 | 0.65 | 0.00 | 0.00 | 0.00 | 0 | 21 | ▇▁▁▁▁ |
days_in_waiting_list | 0 | 1 | 2.32 | 17.59 | 0.00 | 0.00 | 0.00 | 0 | 391 | ▇▁▁▁▁ |
adr | 0 | 1 | 101.83 | 50.54 | -6.38 | 69.29 | 94.58 | 126 | 5400 | ▇▁▁▁▁ |
required_car_parking_spaces | 0 | 1 | 0.06 | 0.25 | 0.00 | 0.00 | 0.00 | 0 | 8 | ▇▁▁▁▁ |
total_of_special_requests | 0 | 1 | 0.57 | 0.79 | 0.00 | 0.00 | 0.00 | 1 | 5 | ▇▁▁▁▁ |
finding the na values in the dataframe( row and column) using which()
row col
[1,] 40601 11
[2,] 40668 11
[3,] 40680 11
[4,] 41161 11
hotel_bookings[c(40601,40668,40680,41161),]
# A tibble: 4 × 32
hotel is_canceled lead_time arrival_date_year arrival_date_mon…
<chr> <dbl> <dbl> <dbl> <chr>
1 City Hotel 1 2 2015 August
2 City Hotel 1 1 2015 August
3 City Hotel 1 1 2015 August
4 City Hotel 1 8 2015 August
# … with 27 more variables: arrival_date_week_number <dbl>,
# arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>,
# stays_in_week_nights <dbl>, adults <dbl>, children <dbl>,
# babies <dbl>, meal <chr>, country <chr>, market_segment <chr>,
# distribution_channel <chr>, is_repeated_guest <dbl>,
# previous_cancellations <dbl>,
# previous_bookings_not_canceled <dbl>, reserved_room_type <chr>, …
(hotel_bookings<-hotel_bookings %>%
mutate(arrival_month = recode(arrival_date_month,"January"=1,"February"=2,"March"=3,"April"=4,"May"=5,"June"=6,"July"=7,"August"=8,"September"=9,"October"=10,"November"=11,"December"=12)))
# A tibble: 119,386 × 33
hotel is_canceled lead_time arrival_date_ye… arrival_date_mo…
<chr> <dbl> <dbl> <dbl> <chr>
1 Resort Hotel 0 342 2015 July
2 Resort Hotel 0 737 2015 July
3 Resort Hotel 0 7 2015 July
4 Resort Hotel 0 13 2015 July
5 Resort Hotel 0 14 2015 July
6 Resort Hotel 0 14 2015 July
7 Resort Hotel 0 0 2015 July
8 Resort Hotel 0 9 2015 July
9 Resort Hotel 1 85 2015 July
10 Resort Hotel 1 75 2015 July
# … with 119,376 more rows, and 28 more variables:
# arrival_date_week_number <dbl>, arrival_date_day_of_month <dbl>,
# stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>,
# adults <dbl>, children <dbl>, babies <dbl>, meal <chr>,
# country <chr>, market_segment <chr>, distribution_channel <chr>,
# is_repeated_guest <dbl>, previous_cancellations <dbl>,
# previous_bookings_not_canceled <dbl>, reserved_room_type <chr>, …
(hotel_bookings <- hotel_bookings%>%
mutate(arrival_date = make_date(arrival_date_year,arrival_month,arrival_date_day_of_month)))
# A tibble: 119,386 × 34
hotel is_canceled lead_time arrival_date_ye… arrival_date_mo…
<chr> <dbl> <dbl> <dbl> <chr>
1 Resort Hotel 0 342 2015 July
2 Resort Hotel 0 737 2015 July
3 Resort Hotel 0 7 2015 July
4 Resort Hotel 0 13 2015 July
5 Resort Hotel 0 14 2015 July
6 Resort Hotel 0 14 2015 July
7 Resort Hotel 0 0 2015 July
8 Resort Hotel 0 9 2015 July
9 Resort Hotel 1 85 2015 July
10 Resort Hotel 1 75 2015 July
# … with 119,376 more rows, and 29 more variables:
# arrival_date_week_number <dbl>, arrival_date_day_of_month <dbl>,
# stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>,
# adults <dbl>, children <dbl>, babies <dbl>, meal <chr>,
# country <chr>, market_segment <chr>, distribution_channel <chr>,
# is_repeated_guest <dbl>, previous_cancellations <dbl>,
# previous_bookings_not_canceled <dbl>, reserved_room_type <chr>, …
looking for type of hotels booked by guests
(hotels_info<-data.frame(table(hotel_bookings$hotel)))
Var1 Freq
1 City Hotel 79326
2 Resort Hotel 40060
pie(hotels_info$Freq, labels=paste(hotels_info$Var1,sep = "=", hotels_info$Freq), main = "type of hotels booked by guests")
(guests_country_details <- hotel_bookings %>%
group_by(country) %>%
count() %>%
ungroup() %>%
arrange(desc(n)))
# A tibble: 178 × 2
country n
<chr> <int>
1 PRT 48586
2 GBR 12129
3 FRA 10415
4 ESP 8568
5 DEU 7287
6 ITA 3766
7 IRL 3375
8 BEL 2342
9 BRA 2224
10 NLD 2104
# … with 168 more rows
most guests by top country’s
ggplot(filter(guests_country_details, n>1500))+
geom_bar(aes(country, n), stat = "identity")+
labs(y="number of guests", title = "most guests by country's")+
coord_flip()
adr - average daily rate
[1] "A" "B" "C" "D" "E" "F" "G" "P"
[1] "A" "B" "C" "D" "E" "F" "G" "H" "L" "P"
(city_hotel_average_guests_per_room_type <- city_hotel_data %>%
group_by(reserved_room_type) %>%
summarise(guests_per_room_type = mean(adults+children)))
# A tibble: 8 × 2
reserved_room_type guests_per_room_type
<chr> <dbl>
1 A 1.82
2 B 2.12
3 C 1.64
4 D 2.21
5 E 2.34
6 F 3.62
7 G 3.29
8 P 0
(city_hotel_adr_per_room_type_data <- city_hotel_data %>%
group_by(reserved_room_type) %>%
summarise(adr_per_room_type = mean(adr)))
# A tibble: 8 × 2
reserved_room_type adr_per_room_type
<chr> <dbl>
1 A 96.2
2 B 90.5
3 C 85.5
4 D 131.
5 E 157.
6 F 189.
7 G 202.
8 P 0
(resort_average_guests_per_room_type <- resort_hotel_data %>%
group_by(reserved_room_type) %>%
summarise(guests_per_room_type = mean(adults+children)))
# A tibble: 10 × 2
reserved_room_type guests_per_room_type
<chr> <dbl>
1 A 1.80
2 B 2
3 C 3.34
4 D 2.00
5 E 1.99
6 F 2.05
7 G 3.37
8 H 3.69
9 L 2.17
10 P 0
##resort hotel average daily rate per room type
(resort_hotel_adr_per_room_type_data <- resort_hotel_data %>%
group_by(reserved_room_type) %>%
summarise(adr_per_room_type = mean(adr)))
# A tibble: 10 × 2
reserved_room_type adr_per_room_type
<chr> <dbl>
1 A 76.2
2 B 105.
3 C 161.
4 D 104.
5 E 114.
6 F 133.
7 G 168.
8 H 188.
9 L 125.
10 P 0
ggplot(city_hotel_adr_per_room_type_data, aes(x=reserved_room_type, y= adr_per_room_type))+
geom_bar(stat = "identity",fill = "steelblue", width = .5)+
labs(x="reserved room type", y=" average daily rate per room", title = "Average daily rate of reserved room type at city hotel")+
theme(plot.title = element_text(size = 14, hjust = 0.5, face = "bold"))
ggplot(resort_hotel_adr_per_room_type_data, aes(x=reserved_room_type, y= adr_per_room_type))+
geom_bar(stat = "identity",fill = "steelblue", width = .5)+
labs(x="reserved room type", y=" average daily rate per room", title = "Average daily rate of reserved room type at resort hotel")+
theme(plot.title = element_text(size = 14, hjust = 0.5, face = "bold"))
city_room_prices_monthly <-select(city_hotel_data, arrival_date_month, adr)
(city_room_prices_monthly <- city_room_prices_monthly %>%
group_by(arrival_date_month) %>%
summarise(mean_room_prices = mean(adr)) %>%
ungroup() %>%
arrange(desc(mean_room_prices)))
# A tibble: 12 × 2
arrival_date_month mean_room_prices
<chr> <dbl>
1 May 122.
2 June 119.
3 August 115.
4 April 111.
5 July 111.
6 September 110.
7 October 100.
8 March 92.6
9 December 88.8
10 November 88.1
11 February 85.1
12 January 82.6
plot <- ggplot(city_room_prices_monthly)+
geom_bar(aes(arrival_date_month, mean_room_prices),stat = "identity")+
labs(x="arrival_date_month", y=" mean_room_prices", title = "mean room prices over the year at city hotel")+
theme(plot.title = element_text(size = 14, hjust = 0.5, face = "bold"))
plot +
scale_x_discrete(limits = c("January", "February", "March", "April", "May", "June",
"July", "August", "September", "October", "November", "December"))
resort_room_prices_monthly <-select(resort_hotel_data, arrival_date_month, adr)
(resort_room_prices_monthly <- resort_room_prices_monthly %>%
group_by(arrival_date_month) %>%
summarise(mean_room_prices = mean(adr)) %>%
ungroup() %>%
arrange(desc(mean_room_prices)))
# A tibble: 12 × 2
arrival_date_month mean_room_prices
<chr> <dbl>
1 August 187.
2 July 155.
3 June 110.
4 September 93.3
5 May 78.8
6 April 77.8
7 December 69.0
8 October 62.1
9 March 57.5
10 February 55.2
11 January 49.5
12 November 48.3
plot <- ggplot(resort_room_prices_monthly)+
geom_bar(aes(arrival_date_month, mean_room_prices),stat = "identity")+
labs(x="arrival_date_month", y=" mean_room_prices", title = "mean room prices over the year at resort hotel")+
theme(plot.title = element_text(size = 14, hjust = 0.5, face = "bold"))
plot +
scale_x_discrete(limits = c("January", "February", "March", "April", "May", "June",
"July", "August", "September", "October", "November", "December"))
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
sathvik_thogaru (2021, Aug. 18). DACSS 601 August 2021: sathvik_hotel_bookings_data_hw3. Retrieved from https://mrolfe.github.io/DACSS601August2021/posts/2021-08-18-sathvikhotelbookingsdatahw3/
BibTeX citation
@misc{sathvik_thogaru2021sathvik_hotel_bookings_data_hw3, author = {sathvik_thogaru, }, title = {DACSS 601 August 2021: sathvik_hotel_bookings_data_hw3}, url = {https://mrolfe.github.io/DACSS601August2021/posts/2021-08-18-sathvikhotelbookingsdatahw3/}, year = {2021} }