Code
library(tidyverse)
library(lubridate)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Matthew Norberg
September 29, 2022
Today’s challenge is to:
Read in one (or more) of the following datasets, using the correct R package and command.
# A tibble: 119,390 × 32
hotel is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
<chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Resor… 0 342 2015 July 27 1 0 0 2
2 Resor… 0 737 2015 July 27 1 0 0 2
3 Resor… 0 7 2015 July 27 1 0 1 1
4 Resor… 0 13 2015 July 27 1 0 1 1
5 Resor… 0 14 2015 July 27 1 0 2 2
6 Resor… 0 14 2015 July 27 1 0 2 2
7 Resor… 0 0 2015 July 27 1 0 2 2
8 Resor… 0 9 2015 July 27 1 0 2 2
9 Resor… 1 85 2015 July 27 1 0 3 2
10 Resor… 1 75 2015 July 27 1 0 3 2
# … with 119,380 more rows, 22 more variables: children <dbl>, babies <dbl>,
# meal <chr>, country <chr>, market_segment <chr>,
# distribution_channel <chr>, is_repeated_guest <dbl>,
# previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
# reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
# deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
# customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …
Data Frame Summary
df
Dimensions: 119390 x 32
Duplicates: 31994
-----------------------------------------------------------------------------------------------------------------------------------
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
---- -------------------------------- -------------------------- ---------------------- ---------------------- ---------- ---------
1 hotel 1. City Hotel 79330 (66.4%) IIIIIIIIIIIII 119390 0
[character] 2. Resort Hotel 40060 (33.6%) IIIIII (100.0%) (0.0%)
2 is_canceled Min : 0 0 : 75166 (63.0%) IIIIIIIIIIII 119390 0
[numeric] Mean : 0.4 1 : 44224 (37.0%) IIIIIII (100.0%) (0.0%)
Max : 1
3 lead_time Mean (sd) : 104 (106.9) 479 distinct values : 119390 0
[numeric] min < med < max: : (100.0%) (0.0%)
0 < 69 < 737 :
IQR (CV) : 142 (1) : : .
: : : . .
4 arrival_date_year Mean (sd) : 2016.2 (0.7) 2015 : 21996 (18.4%) III 119390 0
[numeric] min < med < max: 2016 : 56707 (47.5%) IIIIIIIII (100.0%) (0.0%)
2015 < 2016 < 2017 2017 : 40687 (34.1%) IIIIII
IQR (CV) : 1 (0)
5 arrival_date_month 1. August 13877 (11.6%) II 119390 0
[character] 2. July 12661 (10.6%) II (100.0%) (0.0%)
3. May 11791 ( 9.9%) I
4. October 11160 ( 9.3%) I
5. April 11089 ( 9.3%) I
6. June 10939 ( 9.2%) I
7. September 10508 ( 8.8%) I
8. March 9794 ( 8.2%) I
9. February 8068 ( 6.8%) I
10. November 6794 ( 5.7%) I
[ 2 others ] 12709 (10.6%) II
6 arrival_date_week_number Mean (sd) : 27.2 (13.6) 53 distinct values . : . . . 119390 0
[numeric] min < med < max: . : : : : : : (100.0%) (0.0%)
1 < 28 < 53 . : : : : : : : : :
IQR (CV) : 22 (0.5) : : : : : : : : : :
: : : : : : : : : :
7 arrival_date_day_of_month Mean (sd) : 15.8 (8.8) 31 distinct values : 119390 0
[numeric] min < med < max: : : : . : : . : : (100.0%) (0.0%)
1 < 16 < 31 : : : : : : : : : :
IQR (CV) : 15 (0.6) : : : : : : : : : :
: : : : : : : : : :
8 stays_in_weekend_nights Mean (sd) : 0.9 (1) 17 distinct values : 119390 0
[numeric] min < med < max: : (100.0%) (0.0%)
0 < 1 < 19 :
IQR (CV) : 2 (1.1) : :
: :
9 stays_in_week_nights Mean (sd) : 2.5 (1.9) 35 distinct values : 119390 0
[numeric] min < med < max: : (100.0%) (0.0%)
0 < 2 < 50 :
IQR (CV) : 2 (0.8) :
:
10 adults Mean (sd) : 1.9 (0.6) 14 distinct values : 119390 0
[numeric] min < med < max: : (100.0%) (0.0%)
0 < 2 < 55 :
IQR (CV) : 0 (0.3) :
:
11 children Mean (sd) : 0.1 (0.4) 0 : 110796 (92.8%) IIIIIIIIIIIIIIIIII 119386 4
[numeric] min < med < max: 1 : 4861 ( 4.1%) (100.0%) (0.0%)
0 < 0 < 10 2 : 3652 ( 3.1%)
IQR (CV) : 0 (3.8) 3 : 76 ( 0.1%)
10 : 1 ( 0.0%)
12 babies Mean (sd) : 0 (0.1) 0 : 118473 (99.2%) IIIIIIIIIIIIIIIIIII 119390 0
[numeric] min < med < max: 1 : 900 ( 0.8%) (100.0%) (0.0%)
0 < 0 < 10 2 : 15 ( 0.0%)
IQR (CV) : 0 (12.3) 9 : 1 ( 0.0%)
10 : 1 ( 0.0%)
13 meal 1. BB 92310 (77.3%) IIIIIIIIIIIIIII 119390 0
[character] 2. FB 798 ( 0.7%) (100.0%) (0.0%)
3. HB 14463 (12.1%) II
4. SC 10650 ( 8.9%) I
5. Undefined 1169 ( 1.0%)
14 country 1. PRT 48590 (40.7%) IIIIIIII 119390 0
[character] 2. GBR 12129 (10.2%) II (100.0%) (0.0%)
3. FRA 10415 ( 8.7%) I
4. ESP 8568 ( 7.2%) I
5. DEU 7287 ( 6.1%) I
6. ITA 3766 ( 3.2%)
7. IRL 3375 ( 2.8%)
8. BEL 2342 ( 2.0%)
9. BRA 2224 ( 1.9%)
10. NLD 2104 ( 1.8%)
[ 168 others ] 18590 (15.6%) III
15 market_segment 1. Aviation 237 ( 0.2%) 119390 0
[character] 2. Complementary 743 ( 0.6%) (100.0%) (0.0%)
3. Corporate 5295 ( 4.4%)
4. Direct 12606 (10.6%) II
5. Groups 19811 (16.6%) III
6. Offline TA/TO 24219 (20.3%) IIII
7. Online TA 56477 (47.3%) IIIIIIIII
8. Undefined 2 ( 0.0%)
16 distribution_channel 1. Corporate 6677 ( 5.6%) I 119390 0
[character] 2. Direct 14645 (12.3%) II (100.0%) (0.0%)
3. GDS 193 ( 0.2%)
4. TA/TO 97870 (82.0%) IIIIIIIIIIIIIIII
5. Undefined 5 ( 0.0%)
17 is_repeated_guest Min : 0 0 : 115580 (96.8%) IIIIIIIIIIIIIIIIIII 119390 0
[numeric] Mean : 0 1 : 3810 ( 3.2%) (100.0%) (0.0%)
Max : 1
18 previous_cancellations Mean (sd) : 0.1 (0.8) 15 distinct values : 119390 0
[numeric] min < med < max: : (100.0%) (0.0%)
0 < 0 < 26 :
IQR (CV) : 0 (9.7) :
:
19 previous_bookings_not_canceled Mean (sd) : 0.1 (1.5) 73 distinct values : 119390 0
[numeric] min < med < max: : (100.0%) (0.0%)
0 < 0 < 72 :
IQR (CV) : 0 (10.9) :
:
20 reserved_room_type 1. A 85994 (72.0%) IIIIIIIIIIIIII 119390 0
[character] 2. B 1118 ( 0.9%) (100.0%) (0.0%)
3. C 932 ( 0.8%)
4. D 19201 (16.1%) III
5. E 6535 ( 5.5%) I
6. F 2897 ( 2.4%)
7. G 2094 ( 1.8%)
8. H 601 ( 0.5%)
9. L 6 ( 0.0%)
10. P 12 ( 0.0%)
21 assigned_room_type 1. A 74053 (62.0%) IIIIIIIIIIII 119390 0
[character] 2. D 25322 (21.2%) IIII (100.0%) (0.0%)
3. E 7806 ( 6.5%) I
4. F 3751 ( 3.1%)
5. G 2553 ( 2.1%)
6. C 2375 ( 2.0%)
7. B 2163 ( 1.8%)
8. H 712 ( 0.6%)
9. I 363 ( 0.3%)
10. K 279 ( 0.2%)
[ 2 others ] 13 ( 0.0%)
22 booking_changes Mean (sd) : 0.2 (0.7) 21 distinct values : 119390 0
[numeric] min < med < max: : (100.0%) (0.0%)
0 < 0 < 21 :
IQR (CV) : 0 (2.9) :
:
23 deposit_type 1. No Deposit 104641 (87.6%) IIIIIIIIIIIIIIIII 119390 0
[character] 2. Non Refund 14587 (12.2%) II (100.0%) (0.0%)
3. Refundable 162 ( 0.1%)
24 agent 1. 9 31961 (26.8%) IIIII 119390 0
[character] 2. NULL 16340 (13.7%) II (100.0%) (0.0%)
3. 240 13922 (11.7%) II
4. 1 7191 ( 6.0%) I
5. 14 3640 ( 3.0%)
6. 7 3539 ( 3.0%)
7. 6 3290 ( 2.8%)
8. 250 2870 ( 2.4%)
9. 241 1721 ( 1.4%)
10. 28 1666 ( 1.4%)
[ 324 others ] 33250 (27.8%) IIIII
25 company 1. NULL 112593 (94.3%) IIIIIIIIIIIIIIIIII 119390 0
[character] 2. 40 927 ( 0.8%) (100.0%) (0.0%)
3. 223 784 ( 0.7%)
4. 67 267 ( 0.2%)
5. 45 250 ( 0.2%)
6. 153 215 ( 0.2%)
7. 174 149 ( 0.1%)
8. 219 141 ( 0.1%)
9. 281 138 ( 0.1%)
10. 154 133 ( 0.1%)
[ 343 others ] 3793 ( 3.2%)
26 days_in_waiting_list Mean (sd) : 2.3 (17.6) 128 distinct values : 119390 0
[numeric] min < med < max: : (100.0%) (0.0%)
0 < 0 < 391 :
IQR (CV) : 0 (7.6) :
:
27 customer_type 1. Contract 4076 ( 3.4%) 119390 0
[character] 2. Group 577 ( 0.5%) (100.0%) (0.0%)
3. Transient 89613 (75.1%) IIIIIIIIIIIIIII
4. Transient-Party 25124 (21.0%) IIII
28 adr Mean (sd) : 101.8 (50.5) 8879 distinct values : 119390 0
[numeric] min < med < max: : (100.0%) (0.0%)
-6.4 < 94.6 < 5400 :
IQR (CV) : 56.7 (0.5) :
:
29 required_car_parking_spaces Mean (sd) : 0.1 (0.2) 0 : 111974 (93.8%) IIIIIIIIIIIIIIIIII 119390 0
[numeric] min < med < max: 1 : 7383 ( 6.2%) I (100.0%) (0.0%)
0 < 0 < 8 2 : 28 ( 0.0%)
IQR (CV) : 0 (3.9) 3 : 3 ( 0.0%)
8 : 2 ( 0.0%)
30 total_of_special_requests Mean (sd) : 0.6 (0.8) 0 : 70318 (58.9%) IIIIIIIIIII 119390 0
[numeric] min < med < max: 1 : 33226 (27.8%) IIIII (100.0%) (0.0%)
0 < 0 < 5 2 : 12969 (10.9%) II
IQR (CV) : 1 (1.4) 3 : 2497 ( 2.1%)
4 : 340 ( 0.3%)
5 : 40 ( 0.0%)
31 reservation_status 1. Canceled 43017 (36.0%) IIIIIII 119390 0
[character] 2. Check-Out 75166 (63.0%) IIIIIIIIIIII (100.0%) (0.0%)
3. No-Show 1207 ( 1.0%)
32 reservation_status_date min : 2014-10-17 926 distinct values . : : : : 119390 0
[Date] med : 2016-08-07 : : : : : : . (100.0%) (0.0%)
max : 2017-09-14 . : : : : : : :
range : 2y 10m 28d : : : : : : : :
. : : : : : : : :
-----------------------------------------------------------------------------------------------------------------------------------
Each row in the data set describes a hotel booking for a hotel chain. We can see that from the ‘hotel’ column that each hotel in the chain is classified as either a ‘City Hotel’ or a ‘Resort Hotel’. Looking at the ‘arrival_date_year’ column, we can see that the data was collected between 2015 and 2017. Additionally, we can see from the ‘country’ column that the data was collected from all over the world. In total, we can see that there are 178 different values in the ‘country’ column. However, one of those values is “NULL”, so there are only 177 known countries included in the data set.
Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.
There are four rows in the data set that contain values that are not available. They are all in the ‘children’ column. I believe it is safe to remove these values since it is only 4 cases out of 119,390.
The data is tidy after removing the values which are not available. Every row represents an observation, each column represents one variable, and each value has its own cell.
Any additional comments?
Looking at the summary of the data frame, we can see that the ‘agent’ and ‘company’ columns are of type character. However, upon inspection, it appears as though they are all integers. We can inspect the two values in the two columns with the following R commands:
# A tibble: 334 × 1
agent
<chr>
1 NULL
2 304
3 240
4 303
5 15
6 241
7 8
8 250
9 115
10 5
# … with 324 more rows
# A tibble: 353 × 1
company
<chr>
1 NULL
2 110
3 113
4 270
5 178
6 240
7 154
8 144
9 307
10 268
# … with 343 more rows
By inspection, we can see that the columns are of character type because of string “NULL” in each column. This was likely placed in the data set to indicate that an agent or company was not used. However, it is a bit annoying that the columns are character types. For instance, this prevents you from sorting the columns by value. Here is an example:
# A tibble: 334 × 1
agent
<chr>
1 1
2 10
3 103
4 104
5 105
6 106
7 107
8 11
9 110
10 111
# … with 324 more rows
We can see in the tibble above, that 107 comes before 11. That isn’t quite right and it is only happening because the column is of type character. If we look at the column a bit further, we can see that none of the values are less than 0. These queries confirm that this is the case.
# A tibble: 0 × 1
# … with 1 variable: agent <chr>
# A tibble: 0 × 1
# … with 1 variable: company <chr>
To change the type of the column from character to double, I think we should replace all of the “NULL” strings with “-1”. Then we can convert the type of the column. We will just use the convention that a value of -1 in either column indicates that an agent or company was not used.
Now, the ‘agent’ and ‘company’ columns have been converted to double.
Are there any variables that require mutation to be usable in your analysis stream? For example, are all time variables correctly coded as dates? Are all string variables reduced and cleaned to sensible categories? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?
Document your work here.
First, I think that the ‘arrival_date_week_number’ is not useful. We have the arrival month, year, and the day of the month. I am not sure why we need the ‘arrival_date_week_number’ as well. Also, certain dates do not always fall on the same date of the year. For example, this year, January 1st was on a Friday and last year it was on a Saturday. Consequently, the week of the year that someone arrives may not fall on the same date between years. Lastly, I am not sure how they determine what the first week of the year is. In the previous example, is the first week the span of Friday, Saturday, and Sunday because the first is on a Friday? I don’t know, but the first week could also be considered as the following week in the example because it is the first full week. Due to these details, I think it’s best to drop that column from the data set.
The next observation I had is that the ‘adr’ column stands for average daily rate. I think it would be good to rename this column to make the data more readable.
Looking at the columns, it appears as though many of the columns represent categorical data. These can all be converted to factors to make it easier to understand how many times each category appears in the data set.
# Create factors for categorical data
month_levels <- c(
"January", "February", "March", "April", "May", "June",
"July", "August", "September", "October", "November", "December"
)
df$arrival_date_month <- factor(df$arrival_date_month, month_levels)
df$meal <- factor(df$meal, c("BB", "HB", "FB", "SC", "Undefined"))
df$hotel <- factor(df$hotel, c("City Hotel", "Resort Hotel"))
df$market_segment <- factor(df$market_segment, c("Aviation", "Complementary", "Corporate", "Direct", "Groups",
"Offline TA/TO", "Online TA"))
df$distribution_channel <- factor(df$distribution_channel, c("Corporate", "Direct", "GDS", "TA/TO", "Undefined"))
df$assigned_room_type <- factor(df$assigned_room_type, unique(df$assigned_room_type))
df$reserved_room_type <- factor(df$reserved_room_type, unique(df$reserved_room_type))
df$deposit_type <- factor(df$deposit_type, unique(df$deposit_type))
df$customer_type <- factor(df$customer_type, unique(df$customer_type))
df$reservation_status <- factor(df$reservation_status, unique(df$reservation_status))
The next observation is that we are recording the arrival date in three columns, instead of one. We can use mutate to combine the information in these columns. Afterwards, we can remove the old date columns to simplify the data set.
# Convert arrival information into single column and remove old columns
df <- df %>% mutate(date = make_date(arrival_date_year, arrival_date_month, arrival_date_day_of_month))
df <- df %>% relocate(date, .after = hotel)
df <- df %>% select(-c(arrival_date_year, arrival_date_month, arrival_date_day_of_month))
Next, we can see that the number of nights that the booking was scheduled for is recorded in two columns. One records the weeknights and the other records the weekdays. I think it would be a good idea to create a new column called total nights using mutate. This is done below.
Now that we have tidied up the tibble, we need to present it in a way that is readable. I think the best way to do that is to sort it by arrival date.
Any additional comments?
---
title: "Challenge 4 Instructions"
author: "Matthew Norberg"
desription: "More data wrangling: pivoting"
date: "09/29/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_4
- abc_poll
- eggs
- fed_rates
- hotel_bookings
- debt
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
library(lubridate)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to:
1) read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
2) tidy data (as needed, including sanity checks)
3) identify variables that need to be mutated
4) mutate variables and sanity check all mutations
## Read in data
Read in one (or more) of the following datasets, using the correct R package and command.
- abc_poll.csv ⭐
- poultry_tidy.xlsx or organiceggpoultry.xls⭐⭐
- FedFundsRate.csv⭐⭐⭐
- hotel_bookings.csv⭐⭐⭐⭐
- debt_in_trillions.xlsx ⭐⭐⭐⭐⭐
```{r}
# Read in data set and display values
df <- read_csv('./_data/hotel_bookings.csv')
df
# Print a summary of the data set to help describe it
summarytools::dfSummary(df)
```
### Briefly describe the data
Each row in the data set describes a hotel booking for a hotel chain. We can see that from the 'hotel' column that each hotel in the chain is classified as either a 'City Hotel' or a 'Resort Hotel'. Looking at the 'arrival_date_year' column, we can see that the data was collected between 2015 and 2017. Additionally, we can see from the 'country' column that the data was collected from all over the world. In total, we can see that there are 178 different values in the 'country' column. However, one of those values is "NULL", so there are only 177 known countries included in the data set.
## Tidy Data (as needed)
Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.
There are four rows in the data set that contain values that are not available. They are all in the 'children' column. I believe it is safe to remove these values since it is only 4 cases out of 119,390.
```{r}
# Remove NA values
df <- na.omit(df)
```
The data is tidy after removing the values which are not available. Every row represents an observation, each column represents one variable, and each value has its own cell.
Any additional comments?
Looking at the summary of the data frame, we can see that the 'agent' and 'company' columns are of type character. However, upon inspection, it appears as though they are all integers. We can inspect the two values in the two columns with the following R commands:
```{r}
# List distinct values in agent and company
distinct(df, agent)
distinct(df, company)
```
By inspection, we can see that the columns are of character type because of string "NULL" in each column. This was likely placed in the data set to indicate that an agent or company was not used. However, it is a bit annoying that the columns are character types. For instance, this prevents you from sorting the columns by value. Here is an example:
```{r}
# Sort distinct values in agent column
arrange(distinct(df, agent), agent)
```
We can see in the tibble above, that 107 comes before 11. That isn't quite right and it is only happening because the column is of type character. If we look at the column a bit further, we can see that none of the values are less than 0. These queries confirm that this is the case.
```{r}
# Find agent and company values that have integer value less than 0
distinct(df, agent) %>% filter(agent != "NULL") %>% filter(as.integer(agent) <= 0)
distinct(df, company) %>% filter(company != "NULL") %>% filter(as.integer(company) <= 0)
```
To change the type of the column from character to double, I think we should replace all of the "NULL" strings with "-1". Then we can convert the type of the column. We will just use the convention that a value of -1 in either column indicates that an agent or company was not used.
```{r}
# Convert agent and company columns to double type
# Note: Replace "NULL" with "-1" first to do so
df <- df %>%
mutate(agent = str_replace(agent, "NULL", "-1")) %>%
mutate(company = str_replace(company, "NULL", "-1")) %>%
type_convert()
```
Now, the 'agent' and 'company' columns have been converted to double.
## Identify variables that need to be mutated
Are there any variables that require mutation to be usable in your analysis stream? For example, are all time variables correctly coded as dates? Are all string variables reduced and cleaned to sensible categories? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?
Document your work here.
First, I think that the 'arrival_date_week_number' is not useful. We have the arrival month, year, and the day of the month. I am not sure why we need the 'arrival_date_week_number' as well. Also, certain dates do not always fall on the same date of the year. For example, this year, January 1st was on a Friday and last year it was on a Saturday. Consequently, the week of the year that someone arrives may not fall on the same date between years. Lastly, I am not sure how they determine what the first week of the year is. In the previous example, is the first week the span of Friday, Saturday, and Sunday because the first is on a Friday? I don't know, but the first week could also be considered as the following week in the example because it is the first full week. Due to these details, I think it's best to drop that column from the data set.
```{r}
# Remove arrival_date_week_number
df <- df %>% select(-(arrival_date_week_number))
```
The next observation I had is that the 'adr' column stands for average daily rate. I think it would be good to rename this column to make the data more readable.
```{r}
# Rename adr column
df <- df %>% rename(average_daily_rate = adr)
```
Looking at the columns, it appears as though many of the columns represent categorical data. These can all be converted to factors to make it easier to understand how many times each category appears in the data set.
```{r}
# Create factors for categorical data
month_levels <- c(
"January", "February", "March", "April", "May", "June",
"July", "August", "September", "October", "November", "December"
)
df$arrival_date_month <- factor(df$arrival_date_month, month_levels)
df$meal <- factor(df$meal, c("BB", "HB", "FB", "SC", "Undefined"))
df$hotel <- factor(df$hotel, c("City Hotel", "Resort Hotel"))
df$market_segment <- factor(df$market_segment, c("Aviation", "Complementary", "Corporate", "Direct", "Groups",
"Offline TA/TO", "Online TA"))
df$distribution_channel <- factor(df$distribution_channel, c("Corporate", "Direct", "GDS", "TA/TO", "Undefined"))
df$assigned_room_type <- factor(df$assigned_room_type, unique(df$assigned_room_type))
df$reserved_room_type <- factor(df$reserved_room_type, unique(df$reserved_room_type))
df$deposit_type <- factor(df$deposit_type, unique(df$deposit_type))
df$customer_type <- factor(df$customer_type, unique(df$customer_type))
df$reservation_status <- factor(df$reservation_status, unique(df$reservation_status))
```
The next observation is that we are recording the arrival date in three columns, instead of one. We can use mutate to combine the information in these columns. Afterwards, we can remove the old date columns to simplify the data set.
```{r}
# Convert arrival information into single column and remove old columns
df <- df %>% mutate(date = make_date(arrival_date_year, arrival_date_month, arrival_date_day_of_month))
df <- df %>% relocate(date, .after = hotel)
df <- df %>% select(-c(arrival_date_year, arrival_date_month, arrival_date_day_of_month))
```
Next, we can see that the number of nights that the booking was scheduled for is recorded in two columns. One records the weeknights and the other records the weekdays. I think it would be a good idea to create a new column called total nights using mutate. This is done below.
```{r}
# Create a column to store total number of nights stayed
df <- df %>% mutate(total_nights = stays_in_weekend_nights + stays_in_week_nights) %>%
relocate(total_nights, .after = stays_in_week_nights)
```
Now that we have tidied up the tibble, we need to present it in a way that is readable. I think the best way to do that is to sort it by arrival date.
```{r}
# Sort by date
df <- df %>% arrange(date)
```
Any additional comments?