library(tidyverse)
library(ggplot2)
library(summarytools)
library(lubridate)
library(treemap)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Challenge 6
Challenge Overview
Today’s challenge is to:
- read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
- tidy data (as needed, including sanity checks)
- mutate variables as needed (including sanity checks)
- create at least one graph including time (evolution)
- try to make them “publication” ready (optional)
- Explain why you choose the specific graph type
- Create at least one graph depicting part-whole or flow relationships
- try to make them “publication” ready (optional)
- Explain why you choose the specific graph type
R Graph Gallery is a good starting point for thinking about what information is conveyed in standard graph types, and includes example R code.
(be sure to only include the category tags for the data you use!)
Read in data
Read in one (or more) of the following datasets, using the correct R package and command.
- debt ⭐
- fed_rate ⭐⭐
- abc_poll ⭐⭐⭐
- usa_hh ⭐⭐⭐
- hotel_bookings ⭐⭐⭐⭐
- AB_NYC ⭐⭐⭐⭐⭐
# read in hotel data
<- read_csv("_data/hotel_bookings.csv")
hotel_orig # view hotel data
hotel_orig
# A tibble: 119,390 × 32
hotel is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
<chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Resor… 0 342 2015 July 27 1 0 0 2
2 Resor… 0 737 2015 July 27 1 0 0 2
3 Resor… 0 7 2015 July 27 1 0 1 1
4 Resor… 0 13 2015 July 27 1 0 1 1
5 Resor… 0 14 2015 July 27 1 0 2 2
6 Resor… 0 14 2015 July 27 1 0 2 2
7 Resor… 0 0 2015 July 27 1 0 2 2
8 Resor… 0 9 2015 July 27 1 0 2 2
9 Resor… 1 85 2015 July 27 1 0 3 2
10 Resor… 1 75 2015 July 27 1 0 3 2
# … with 119,380 more rows, 22 more variables: children <dbl>, babies <dbl>,
# meal <chr>, country <chr>, market_segment <chr>,
# distribution_channel <chr>, is_repeated_guest <dbl>,
# previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
# reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
# deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
# customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …
# view summary of hotel data
dfSummary(hotel_orig)
Data Frame Summary
hotel_orig
Dimensions: 119390 x 32
Duplicates: 31994
-----------------------------------------------------------------------------------------------------------------------------------
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
---- -------------------------------- -------------------------- ---------------------- ---------------------- ---------- ---------
1 hotel 1. City Hotel 79330 (66.4%) IIIIIIIIIIIII 119390 0
[character] 2. Resort Hotel 40060 (33.6%) IIIIII (100.0%) (0.0%)
2 is_canceled Min : 0 0 : 75166 (63.0%) IIIIIIIIIIII 119390 0
[numeric] Mean : 0.4 1 : 44224 (37.0%) IIIIIII (100.0%) (0.0%)
Max : 1
3 lead_time Mean (sd) : 104 (106.9) 479 distinct values : 119390 0
[numeric] min < med < max: : (100.0%) (0.0%)
0 < 69 < 737 :
IQR (CV) : 142 (1) : : .
: : : . .
4 arrival_date_year Mean (sd) : 2016.2 (0.7) 2015 : 21996 (18.4%) III 119390 0
[numeric] min < med < max: 2016 : 56707 (47.5%) IIIIIIIII (100.0%) (0.0%)
2015 < 2016 < 2017 2017 : 40687 (34.1%) IIIIII
IQR (CV) : 1 (0)
5 arrival_date_month 1. August 13877 (11.6%) II 119390 0
[character] 2. July 12661 (10.6%) II (100.0%) (0.0%)
3. May 11791 ( 9.9%) I
4. October 11160 ( 9.3%) I
5. April 11089 ( 9.3%) I
6. June 10939 ( 9.2%) I
7. September 10508 ( 8.8%) I
8. March 9794 ( 8.2%) I
9. February 8068 ( 6.8%) I
10. November 6794 ( 5.7%) I
[ 2 others ] 12709 (10.6%) II
6 arrival_date_week_number Mean (sd) : 27.2 (13.6) 53 distinct values . : . . . 119390 0
[numeric] min < med < max: . : : : : : : (100.0%) (0.0%)
1 < 28 < 53 . : : : : : : : : :
IQR (CV) : 22 (0.5) : : : : : : : : : :
: : : : : : : : : :
7 arrival_date_day_of_month Mean (sd) : 15.8 (8.8) 31 distinct values : 119390 0
[numeric] min < med < max: : : : . : : . : : (100.0%) (0.0%)
1 < 16 < 31 : : : : : : : : : :
IQR (CV) : 15 (0.6) : : : : : : : : : :
: : : : : : : : : :
8 stays_in_weekend_nights Mean (sd) : 0.9 (1) 17 distinct values : 119390 0
[numeric] min < med < max: : (100.0%) (0.0%)
0 < 1 < 19 :
IQR (CV) : 2 (1.1) : :
: :
9 stays_in_week_nights Mean (sd) : 2.5 (1.9) 35 distinct values : 119390 0
[numeric] min < med < max: : (100.0%) (0.0%)
0 < 2 < 50 :
IQR (CV) : 2 (0.8) :
:
10 adults Mean (sd) : 1.9 (0.6) 14 distinct values : 119390 0
[numeric] min < med < max: : (100.0%) (0.0%)
0 < 2 < 55 :
IQR (CV) : 0 (0.3) :
:
11 children Mean (sd) : 0.1 (0.4) 0 : 110796 (92.8%) IIIIIIIIIIIIIIIIII 119386 4
[numeric] min < med < max: 1 : 4861 ( 4.1%) (100.0%) (0.0%)
0 < 0 < 10 2 : 3652 ( 3.1%)
IQR (CV) : 0 (3.8) 3 : 76 ( 0.1%)
10 : 1 ( 0.0%)
12 babies Mean (sd) : 0 (0.1) 0 : 118473 (99.2%) IIIIIIIIIIIIIIIIIII 119390 0
[numeric] min < med < max: 1 : 900 ( 0.8%) (100.0%) (0.0%)
0 < 0 < 10 2 : 15 ( 0.0%)
IQR (CV) : 0 (12.3) 9 : 1 ( 0.0%)
10 : 1 ( 0.0%)
13 meal 1. BB 92310 (77.3%) IIIIIIIIIIIIIII 119390 0
[character] 2. FB 798 ( 0.7%) (100.0%) (0.0%)
3. HB 14463 (12.1%) II
4. SC 10650 ( 8.9%) I
5. Undefined 1169 ( 1.0%)
14 country 1. PRT 48590 (40.7%) IIIIIIII 119390 0
[character] 2. GBR 12129 (10.2%) II (100.0%) (0.0%)
3. FRA 10415 ( 8.7%) I
4. ESP 8568 ( 7.2%) I
5. DEU 7287 ( 6.1%) I
6. ITA 3766 ( 3.2%)
7. IRL 3375 ( 2.8%)
8. BEL 2342 ( 2.0%)
9. BRA 2224 ( 1.9%)
10. NLD 2104 ( 1.8%)
[ 168 others ] 18590 (15.6%) III
15 market_segment 1. Aviation 237 ( 0.2%) 119390 0
[character] 2. Complementary 743 ( 0.6%) (100.0%) (0.0%)
3. Corporate 5295 ( 4.4%)
4. Direct 12606 (10.6%) II
5. Groups 19811 (16.6%) III
6. Offline TA/TO 24219 (20.3%) IIII
7. Online TA 56477 (47.3%) IIIIIIIII
8. Undefined 2 ( 0.0%)
16 distribution_channel 1. Corporate 6677 ( 5.6%) I 119390 0
[character] 2. Direct 14645 (12.3%) II (100.0%) (0.0%)
3. GDS 193 ( 0.2%)
4. TA/TO 97870 (82.0%) IIIIIIIIIIIIIIII
5. Undefined 5 ( 0.0%)
17 is_repeated_guest Min : 0 0 : 115580 (96.8%) IIIIIIIIIIIIIIIIIII 119390 0
[numeric] Mean : 0 1 : 3810 ( 3.2%) (100.0%) (0.0%)
Max : 1
18 previous_cancellations Mean (sd) : 0.1 (0.8) 15 distinct values : 119390 0
[numeric] min < med < max: : (100.0%) (0.0%)
0 < 0 < 26 :
IQR (CV) : 0 (9.7) :
:
19 previous_bookings_not_canceled Mean (sd) : 0.1 (1.5) 73 distinct values : 119390 0
[numeric] min < med < max: : (100.0%) (0.0%)
0 < 0 < 72 :
IQR (CV) : 0 (10.9) :
:
20 reserved_room_type 1. A 85994 (72.0%) IIIIIIIIIIIIII 119390 0
[character] 2. B 1118 ( 0.9%) (100.0%) (0.0%)
3. C 932 ( 0.8%)
4. D 19201 (16.1%) III
5. E 6535 ( 5.5%) I
6. F 2897 ( 2.4%)
7. G 2094 ( 1.8%)
8. H 601 ( 0.5%)
9. L 6 ( 0.0%)
10. P 12 ( 0.0%)
21 assigned_room_type 1. A 74053 (62.0%) IIIIIIIIIIII 119390 0
[character] 2. D 25322 (21.2%) IIII (100.0%) (0.0%)
3. E 7806 ( 6.5%) I
4. F 3751 ( 3.1%)
5. G 2553 ( 2.1%)
6. C 2375 ( 2.0%)
7. B 2163 ( 1.8%)
8. H 712 ( 0.6%)
9. I 363 ( 0.3%)
10. K 279 ( 0.2%)
[ 2 others ] 13 ( 0.0%)
22 booking_changes Mean (sd) : 0.2 (0.7) 21 distinct values : 119390 0
[numeric] min < med < max: : (100.0%) (0.0%)
0 < 0 < 21 :
IQR (CV) : 0 (2.9) :
:
23 deposit_type 1. No Deposit 104641 (87.6%) IIIIIIIIIIIIIIIII 119390 0
[character] 2. Non Refund 14587 (12.2%) II (100.0%) (0.0%)
3. Refundable 162 ( 0.1%)
24 agent 1. 9 31961 (26.8%) IIIII 119390 0
[character] 2. NULL 16340 (13.7%) II (100.0%) (0.0%)
3. 240 13922 (11.7%) II
4. 1 7191 ( 6.0%) I
5. 14 3640 ( 3.0%)
6. 7 3539 ( 3.0%)
7. 6 3290 ( 2.8%)
8. 250 2870 ( 2.4%)
9. 241 1721 ( 1.4%)
10. 28 1666 ( 1.4%)
[ 324 others ] 33250 (27.8%) IIIII
25 company 1. NULL 112593 (94.3%) IIIIIIIIIIIIIIIIII 119390 0
[character] 2. 40 927 ( 0.8%) (100.0%) (0.0%)
3. 223 784 ( 0.7%)
4. 67 267 ( 0.2%)
5. 45 250 ( 0.2%)
6. 153 215 ( 0.2%)
7. 174 149 ( 0.1%)
8. 219 141 ( 0.1%)
9. 281 138 ( 0.1%)
10. 154 133 ( 0.1%)
[ 343 others ] 3793 ( 3.2%)
26 days_in_waiting_list Mean (sd) : 2.3 (17.6) 128 distinct values : 119390 0
[numeric] min < med < max: : (100.0%) (0.0%)
0 < 0 < 391 :
IQR (CV) : 0 (7.6) :
:
27 customer_type 1. Contract 4076 ( 3.4%) 119390 0
[character] 2. Group 577 ( 0.5%) (100.0%) (0.0%)
3. Transient 89613 (75.1%) IIIIIIIIIIIIIII
4. Transient-Party 25124 (21.0%) IIII
28 adr Mean (sd) : 101.8 (50.5) 8879 distinct values : 119390 0
[numeric] min < med < max: : (100.0%) (0.0%)
-6.4 < 94.6 < 5400 :
IQR (CV) : 56.7 (0.5) :
:
29 required_car_parking_spaces Mean (sd) : 0.1 (0.2) 0 : 111974 (93.8%) IIIIIIIIIIIIIIIIII 119390 0
[numeric] min < med < max: 1 : 7383 ( 6.2%) I (100.0%) (0.0%)
0 < 0 < 8 2 : 28 ( 0.0%)
IQR (CV) : 0 (3.9) 3 : 3 ( 0.0%)
8 : 2 ( 0.0%)
30 total_of_special_requests Mean (sd) : 0.6 (0.8) 0 : 70318 (58.9%) IIIIIIIIIII 119390 0
[numeric] min < med < max: 1 : 33226 (27.8%) IIIII (100.0%) (0.0%)
0 < 0 < 5 2 : 12969 (10.9%) II
IQR (CV) : 1 (1.4) 3 : 2497 ( 2.1%)
4 : 340 ( 0.3%)
5 : 40 ( 0.0%)
31 reservation_status 1. Canceled 43017 (36.0%) IIIIIII 119390 0
[character] 2. Check-Out 75166 (63.0%) IIIIIIIIIIII (100.0%) (0.0%)
3. No-Show 1207 ( 1.0%)
32 reservation_status_date min : 2014-10-17 926 distinct values . : : : : 119390 0
[Date] med : 2016-08-07 : : : : : : . (100.0%) (0.0%)
max : 2017-09-14 . : : : : : : :
range : 2y 10m 28d : : : : : : : :
. : : : : : : : :
-----------------------------------------------------------------------------------------------------------------------------------
Briefly describe the data
This dataset is about the details of hotel bookings of a city hotel and resort hotel, consisting of 119,390 observations (customers/bookings) with 32 variables (hotel
, arrival_date_year
, adults
, children
, babies
, deposit_type
, assigned_room_type
, etc). There are no missing values except four missing in the children
column. This dataset also measures bookings for each day of the month, each week, and each month for years 2015 to 2017. Based on the summary output, the highest number of bookings occurs in August with 13,877 observations (11.6% of the data) with July (12,661 observations) and May (11,791 observations) following behind. This makes sense since May through August are summer months and most people are on vacation during this time.
Tidy Data (as needed)
Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.
This data is mostly tidy so the only thing that needs to be done is to combine the arrival_date_year
, arrival_date_month
, and arrival_date_day_of_month
variables into one column as the arrival_date
, similar to the date format of reservation_status_date
.
<- hotel_orig %>%
hotel # combine the date columns into one column
unite(arrival_date_year, arrival_date_month, arrival_date_day_of_month,
col = arrival_date, sep = "-")
# view new hotel data
hotel
# A tibble: 119,390 × 30
hotel is_ca…¹ lead_…² arriv…³ arriv…⁴ stays…⁵ stays…⁶ adults child…⁷ babies
<chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Resort… 0 342 2015-J… 27 0 0 2 0 0
2 Resort… 0 737 2015-J… 27 0 0 2 0 0
3 Resort… 0 7 2015-J… 27 0 1 1 0 0
4 Resort… 0 13 2015-J… 27 0 1 1 0 0
5 Resort… 0 14 2015-J… 27 0 2 2 0 0
6 Resort… 0 14 2015-J… 27 0 2 2 0 0
7 Resort… 0 0 2015-J… 27 0 2 2 0 0
8 Resort… 0 9 2015-J… 27 0 2 2 0 0
9 Resort… 1 85 2015-J… 27 0 3 2 0 0
10 Resort… 1 75 2015-J… 27 0 3 2 0 0
# … with 119,380 more rows, 20 more variables: meal <chr>, country <chr>,
# market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,
# previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
# reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
# deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
# customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>,
# total_of_special_requests <dbl>, reservation_status <chr>, …
Are there any variables that require mutation to be usable in your analysis stream? For example, do you need to calculate new values in order to graph them? Can string values be represented numerically? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?
Document your work here.
There are several variables (hotel
, meal
, reserved_room_type
, etc.) that need to be converted into factors and arrival_date
needs to be converted into a date format.
<- hotel %>%
hotel # convert hotel, is_canceled, etc., columns to factors
mutate(across(c(hotel, is_canceled, meal:is_repeated_guest, reserved_room_type:assigned_room_type, deposit_type:company,
customer_type, reservation_status), factor)) # convert arrival_date into date format
$arrival_date <- ymd(hotel$arrival_date)
hotel# view new hotel dataset
hotel
# A tibble: 119,390 × 30
hotel is_ca…¹ lead_…² arrival_…³ arriv…⁴ stays…⁵ stays…⁶ adults child…⁷
<fct> <fct> <dbl> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Resort Hot… 0 342 2015-07-01 27 0 0 2 0
2 Resort Hot… 0 737 2015-07-01 27 0 0 2 0
3 Resort Hot… 0 7 2015-07-01 27 0 1 1 0
4 Resort Hot… 0 13 2015-07-01 27 0 1 1 0
5 Resort Hot… 0 14 2015-07-01 27 0 2 2 0
6 Resort Hot… 0 14 2015-07-01 27 0 2 2 0
7 Resort Hot… 0 0 2015-07-01 27 0 2 2 0
8 Resort Hot… 0 9 2015-07-01 27 0 2 2 0
9 Resort Hot… 1 85 2015-07-01 27 0 3 2 0
10 Resort Hot… 1 75 2015-07-01 27 0 3 2 0
# … with 119,380 more rows, 21 more variables: babies <dbl>, meal <fct>,
# country <fct>, market_segment <fct>, distribution_channel <fct>,
# is_repeated_guest <fct>, previous_cancellations <dbl>,
# previous_bookings_not_canceled <dbl>, reserved_room_type <fct>,
# assigned_room_type <fct>, booking_changes <dbl>, deposit_type <fct>,
# agent <fct>, company <fct>, days_in_waiting_list <dbl>,
# customer_type <fct>, adr <dbl>, required_car_parking_spaces <dbl>, …
Time Dependent Visualization
# create a time series graph of adr vs. arrival_date
ggplot(hotel, aes(`arrival_date`, `adr`, color = `hotel`)) +
geom_line()
I chose a time series graph because it best portrays the change of adr
over time as arrival_date
increases. It appears that adr
is relatively stable over time with the exception of the adr between January 2016 and July 2016, which may be an outlier. adr
also doesn’t vary much depending on the type of hotel but there are a few peaks where the resort hotel has a higher adr than the city hotel.
Visualizing Part-Whole Relationships
# create a bar graph based on deposit_type
ggplot(hotel, aes(`deposit_type`, fill = `hotel`)) +
geom_bar()
# create hotel_new dataset with summarized counts of assigned_room_type by hotel
<- hotel %>%
hotel_new group_by(hotel, assigned_room_type) %>%
summarize(n = n())
# create a treemap based on hotel and assigned_room_type
treemap(hotel_new,
index = c("hotel", "assigned_room_type"),
vSize = "n",
type = "index")
I chose a bar graph based on deposit_type
because it best depicts the different deposit types and shows the count of each one based on the type of hotel. This graph clearly displays that a majority of bookings from both hotels do not require a deposit beforehand. If there is a deposit prior to booking, the city hotel has more non-refundable deposits than the resort hotel and the resort hotel has a small amount of refundable deposits.
I chose a treemap based on hotel
and assigned_room_type
because this helps visualize the counts of each assigned room type in a better way by illustrating different blocks of various sizes corresponding to the number of assigned room types. This treemap shows the resort hotel has more room types than the city hotel but the city hotel has the largest amount of “A” assigned room types. The city hotel square is larger than the resort hotel square, meaning there are more assigned rooms in the city hotel than the resort hotel.