read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
tidy data (as needed, including sanity checks)
identify variables that need to be mutated
mutate variables and sanity check all mutations
Read in data
Read in one (or more) of the following datasets, using the correct R package and command.
abc_poll.csv ⭐
poultry_tidy.xlsx or organiceggpoultry.xls⭐⭐
FedFundsRate.csv⭐⭐⭐
hotel_bookings.csv⭐⭐⭐⭐
debt_in_trillions.xlsx ⭐⭐⭐⭐⭐
For this challenge I will be working with the “hotel_bookings” dataset. It is a publicly available data set containing booking transactions from a city hotel and a resort hotel in Portugal.
I used the “hotel_bookings” dataset previously for the Challenge 2 and I will be using the same content for the section 1: read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc).
Code
# Reading the hotel_bookings.csv data set and storing in a data framehotel_data <-read_csv("_data/hotel_bookings.csv")print(hotel_data)
After reading the data using read_csv function, it is stored in a dataframe “hotel_data”. The hotel_bookings data set contains 119390 rows (observations) and 32 columns (attributes/variables) including information such as when the booking was made, length of stay, the number of adults, children, and/or babies, the number of available parking spaces, and many others for the respective hotels. Each row/observation represents a hotel booking. Since this is a public data set, all data elements pertaining to hotel or customer identification are not included in the data set. This data set can have an important role for research and educational purposes in hotel administration, revenue management, machine learning, data mining and other fields. This data set will be helpful for (i) Tourists (people who are booking hotels) to check and understand trends of hotel price over a period of time and plan their travel accordingly within budget, learn about hotel accommodations and features before booking (ii) Hotel Management to keep track of the relevant information about themselves as well as their competitors. Understand and analyze the seasonal trend of hotel booking and accommodate different types of visitors that they have (iii) Tourist / hospitality services - (e.g. travel agency, airlines / car rental companies) to observe the times when hotels in the region are in high demand, analyze the duration of typical stays, and use the information to help plan their own services (iv) Local Government / independent data analysts to observe the overall trend of tourist activities in the region and analyzing the different types of visitors in the hotels during different seasons.
Using the “dfSummary” function from “summarytools” package we find that there are 31994 duplicates in the data. The reason for the identified duplication is that there is no unique id for each booking. It is possible that the booking was made by different tourists and the values for each attribute was exactly the same. This confusion could have been avoided by adding a Booking ID which would be unique for each booking. 66.4% of the data represents city hotel and the remaining 33.6% of the data represents resort hotel. 18.4%, 47.5% and 34.1% of the data correspondingly represents years 2015, 2016 and 2017. I wanted to check the unequal distribution in data for the 3 consecutive years. I further investigated the data and found that the data we have represents hotel bookings from the period July 2015 - August 2017.
Generated by summarytools 1.0.1 (R version 4.2.1) 2022-12-21
Code
#Check for the period of hotel booking dateshotel_data%>%select(hotel, arrival_date_year, arrival_date_month)%>%distinct()
# A tibble: 52 × 3
hotel arrival_date_year arrival_date_month
<chr> <dbl> <chr>
1 Resort Hotel 2015 July
2 Resort Hotel 2015 August
3 Resort Hotel 2015 September
4 Resort Hotel 2015 October
5 Resort Hotel 2015 November
6 Resort Hotel 2015 December
7 Resort Hotel 2016 January
8 Resort Hotel 2016 February
9 Resort Hotel 2016 March
10 Resort Hotel 2016 April
# … with 42 more rows
Code
#Check for missing/null data in the hotel_datasum(is.na(hotel_data))
[1] 4
Code
sum(is.null(hotel_data))
[1] 0
We find that there are 4 NA’s or missing values in the dataset.
Code
# Checking which columns have NA valuescol <-colnames(hotel_data)for (c in col){print(paste0("NA values in ", c, ": ", sum(is.na(hotel_data[,c]))))}
[1] "NA values in hotel: 0"
[1] "NA values in is_canceled: 0"
[1] "NA values in lead_time: 0"
[1] "NA values in arrival_date_year: 0"
[1] "NA values in arrival_date_month: 0"
[1] "NA values in arrival_date_week_number: 0"
[1] "NA values in arrival_date_day_of_month: 0"
[1] "NA values in stays_in_weekend_nights: 0"
[1] "NA values in stays_in_week_nights: 0"
[1] "NA values in adults: 0"
[1] "NA values in children: 4"
[1] "NA values in babies: 0"
[1] "NA values in meal: 0"
[1] "NA values in country: 0"
[1] "NA values in market_segment: 0"
[1] "NA values in distribution_channel: 0"
[1] "NA values in is_repeated_guest: 0"
[1] "NA values in previous_cancellations: 0"
[1] "NA values in previous_bookings_not_canceled: 0"
[1] "NA values in reserved_room_type: 0"
[1] "NA values in assigned_room_type: 0"
[1] "NA values in booking_changes: 0"
[1] "NA values in deposit_type: 0"
[1] "NA values in agent: 0"
[1] "NA values in company: 0"
[1] "NA values in days_in_waiting_list: 0"
[1] "NA values in customer_type: 0"
[1] "NA values in adr: 0"
[1] "NA values in required_car_parking_spaces: 0"
[1] "NA values in total_of_special_requests: 0"
[1] "NA values in reservation_status: 0"
[1] "NA values in reservation_status_date: 0"
We can see that all 4 NA’s in the hotel booking dataset are from the column “Children”. We can either replace the missing values with 0 or we can drop the 4 rows with NA values. I chose to drop the 4 rows as removing 4 rows from a huge dataset with 119390 observations/rows would not affect any of the statistics significantly.
Code
# Checking which columns have NULL valuescol <-colnames(hotel_data)for (c in col){print(paste0("NULL values in ", c, ": ", sum(is.null(hotel_data[,c]))))}
[1] "NULL values in hotel: 0"
[1] "NULL values in is_canceled: 0"
[1] "NULL values in lead_time: 0"
[1] "NULL values in arrival_date_year: 0"
[1] "NULL values in arrival_date_month: 0"
[1] "NULL values in arrival_date_week_number: 0"
[1] "NULL values in arrival_date_day_of_month: 0"
[1] "NULL values in stays_in_weekend_nights: 0"
[1] "NULL values in stays_in_week_nights: 0"
[1] "NULL values in adults: 0"
[1] "NULL values in children: 0"
[1] "NULL values in babies: 0"
[1] "NULL values in meal: 0"
[1] "NULL values in country: 0"
[1] "NULL values in market_segment: 0"
[1] "NULL values in distribution_channel: 0"
[1] "NULL values in is_repeated_guest: 0"
[1] "NULL values in previous_cancellations: 0"
[1] "NULL values in previous_bookings_not_canceled: 0"
[1] "NULL values in reserved_room_type: 0"
[1] "NULL values in assigned_room_type: 0"
[1] "NULL values in booking_changes: 0"
[1] "NULL values in deposit_type: 0"
[1] "NULL values in agent: 0"
[1] "NULL values in company: 0"
[1] "NULL values in days_in_waiting_list: 0"
[1] "NULL values in customer_type: 0"
[1] "NULL values in adr: 0"
[1] "NULL values in required_car_parking_spaces: 0"
[1] "NULL values in total_of_special_requests: 0"
[1] "NULL values in reservation_status: 0"
[1] "NULL values in reservation_status_date: 0"
Code
# Checking which columns have character datatype and have value == "NULL"hotel_data_subset <- hotel_data%>%select_if(is.character)col <-colnames(hotel_data_subset)for (c in col){print(paste0("NULL values in ", c, ": ", sum(hotel_data[,c]=="NULL")))}
[1] "NULL values in hotel: 0"
[1] "NULL values in arrival_date_month: 0"
[1] "NULL values in meal: 0"
[1] "NULL values in country: 488"
[1] "NULL values in market_segment: 0"
[1] "NULL values in distribution_channel: 0"
[1] "NULL values in reserved_room_type: 0"
[1] "NULL values in assigned_room_type: 0"
[1] "NULL values in deposit_type: 0"
[1] "NULL values in agent: 16340"
[1] "NULL values in company: 112593"
[1] "NULL values in customer_type: 0"
[1] "NULL values in reservation_status: 0"
Code
length(unique(hotel_data$country))
[1] 178
Code
table(hotel_data$country)
ABW AGO AIA ALB AND ARE ARG ARM ASM ATA ATF AUS AUT
2 362 1 12 7 51 214 8 1 2 1 426 1263
AZE BDI BEL BEN BFA BGD BGR BHR BHS BIH BLR BOL BRA
17 1 2342 3 1 12 75 5 1 13 26 10 2224
BRB BWA CAF CHE CHL CHN CIV CMR CN COL COM CPV CRI
4 1 5 1730 65 999 6 10 1279 71 2 24 19
CUB CYM CYP CZE DEU DJI DMA DNK DOM DZA ECU EGY ESP
8 1 51 171 7287 1 1 435 14 103 27 32 8568
EST ETH FIN FJI FRA FRO GAB GBR GEO GGY GHA GIB GLP
83 3 447 1 10415 5 4 12129 22 3 4 18 2
GNB GRC GTM GUY HKG HND HRV HUN IDN IMN IND IRL IRN
9 128 4 1 29 1 100 230 35 2 152 3375 83
IRQ ISL ISR ITA JAM JEY JOR JPN KAZ KEN KHM KIR KNA
14 57 669 3766 6 8 21 197 19 6 2 1 2
KOR KWT LAO LBN LBY LCA LIE LKA LTU LUX LVA MAC MAR
133 16 2 31 8 1 3 7 81 287 55 16 259
MCO MDG MDV MEX MKD MLI MLT MMR MNE MOZ MRT MUS MWI
4 1 12 85 10 1 18 1 5 67 1 7 2
MYS MYT NAM NCL NGA NIC NLD NOR NPL NULL NZL OMN PAK
28 2 1 1 34 1 2104 607 1 488 74 18 14
PAN PER PHL PLW POL PRI PRT PRY PYF QAT ROU RUS RWA
9 29 40 1 919 12 48590 4 1 15 500 632 2
SAU SDN SEN SGP SLE SLV SMR SRB STP SUR SVK SVN SWE
48 1 11 39 1 2 1 101 2 5 65 57 1024
SYC SYR TGO THA TJK TMP TUN TUR TWN TZA UGA UKR UMI
2 3 2 59 9 3 39 248 51 5 2 68 1
URY USA UZB VEN VGB VNM ZAF ZMB ZWE
32 2097 4 26 1 8 80 2 4
We can see that there are bookings from people belonging to 178 distinct countries. However, from the output of table() we can see that one country is given “NULL” as the value is unknown. Hence, we can say that there are 177 distinct countries in the hotel_bookings dataset. In future, we may have to drop the rows with “NULL” country if we plan to plot geospatial visualizations.
The datatype is character for both the columns “agent” and “company” due to which the numbers are not sorted/arranged as expected. From the table() we notice that both the columns have all numerical values except for the “NULL” value which is used for the bookings which did not use an agent or a company for booking. If we change these “NULL” string values to a numerical value like -1 (as no negative values are being used in these columns), then we can change the column type to numeric.
Tidy Data (as needed)
Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.
As the first step of tidying the data, I dropped the rows with NA values in “children” column (4 rows to be exact).
Code
# Dropping the rows with NA values in "children" columnhotel_data <- hotel_data%>%subset(!is.na(children))hotel_data
Next, I replaced the “NULL” values with “-1” values in “agent” and “company” columns.
Code
# Replace the "NULL" values with "-1" in "agent" and "company" columnshotel_data <- hotel_data%>%mutate(agent =str_replace(agent, "NULL", "-1"))%>%mutate(company =str_replace(company, "NULL", "-1"))
I also checked that all values are numerical in the “agent” and “company” columns (i.e no “NULL” values).
Code
# Sanity check: Checking that all values are numerical in the "agent" and "company" columns (i.e no "NULL" values)table(hotel_data$agent)
Finally, I converted the datatype of “agent” and “company” columns from character to numeric.
Code
# Converting the datatype of "agent" and "company" columns from character to numerichotel_data <- hotel_data%>%mutate(agent =as.numeric(agent))%>%mutate(company =as.numeric(company))
I verified that the the new datatype of “agent” and “company” is numeric.
Code
# Sanity check: Verify the new datatype of "agent" and "company" is numericstr(hotel_data)
Are there any variables that require mutation to be usable in your analysis stream? For example, are all time variables correctly coded as dates? Are all string variables reduced and cleaned to sensible categories? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?
Document your work here.
Knowing the demand i.e total guests staying in the booked hotel during a time frame would help in visualizing trends in the form of line plots. These trends would be helpful for tourists to identify the best time to visit Portugal and book the rooms earlier for a lesser price or for hotel management/travel agents to inflate the prices of the rooms according to the demand. We can calculate demand as the sum of adults, children and babies.
Code
# Calculate demand as the sum of adults, children and babieshotel_data <- hotel_data%>%mutate(demand = adults+children+babies)hotel_data
Interesting fact! After creating the “demand” attribute, I performed the table() and found that there are 180 bookings with the demand listed as 0. On checking the reservation status for these 180 rows, 155 bookings show that the reservation status is “Check-Out”. According to the dataset, “Check-Out” is defined as – customer has checked in but already departed. It is surprising that the customer checked in and out but the demand is 0! Would like to know more about the reason behind this.
Code
# Combine the columns "arrival_date_year", "arrival_date_month", "arrival_date_day_of_month" to get the arrival date in a single column. library(lubridate)hotel_data <- hotel_data%>%mutate(arrival_date =ymd(paste(hotel_data$arrival_date_year, hotel_data$arrival_date_month, hotel_data$arrival_date_day_of_month, sep="/")))#Removing the columns related to date in the dataset except for the "arrival_date" mutated columnhotel_data <- hotel_data%>%select(-c(arrival_date_year, arrival_date_month, arrival_date_week_number, arrival_date_day_of_month))
I created a new column “arrival_date” which combined the data from “arrival_date_year”, “arrival_date_month”, and “arrival_date_day_of_month” to get the arrival date in a single column and removed the columns “arrival_date_year”, “arrival_date_month”, “arrival_date_day_of_month”, and “arrival_date_week_number” as they are redundant data. The mutated “arrival_date” column will be useful to plot and analyze trends.
Code
# Find the min and max arrival_datemin(hotel_data$arrival_date)
[1] "2015-07-01"
Code
max(hotel_data$arrival_date)
[1] "2017-08-31"
From the mutated variable “arrival_date” we can easily understand that the “hotel_bookings” dataset has data for the arrival period of “2015-07-01” to “2017-08-31”.
Currently, the data contains information about the “lead_time” (Number of days that elapsed between the date of hotel booking and the arrival date) and the “arrival_date” at the hotel. It would be useful to create visualizations between the “arrival_date”, “booking_date” and “adr” for insights. For this purpose, it would be suitable if the “booking_date” was calculated from “arrival_date” and the “lead_time”. This will help customers to understand the right time to book hotels and the demand.
Code
# Calculating "booking_date" variable from "arrival_date" and "lead_time"hotel_data <- hotel_data%>%mutate(booking_date = arrival_date - lead_time)hotel_data
# Summary of booking_datesummary(hotel_data$booking_date)
Min. 1st Qu. Median Mean 3rd Qu. Max.
"2013-06-24" "2015-11-28" "2016-05-04" "2016-05-16" "2016-12-09" "2017-08-31"
We can see that the earliest hotel booking for the period of arrival from “2015-07-01” to “2017-08-31” was done on the date “2013-06-24”. This is a lead_time of 737 days!
Code
# Sort the dataset based on arrival_date.hotel_data <- hotel_data%>%arrange(arrival_date)
The final dataset is sorted based on “arrival_date” in ascending order.
Any additional comments?
After tidying the data and mutating variables, we are left with a dataset of 119386 rows/observations and 32 columns/variables. We can use this dataset to perform visualizations and generate insights.
Source Code
---title: "Challenge 4 Solutions"author: "Vinitha Maheswaran"desription: "More data wrangling: pivoting"date: "11/26/2022"format: html: toc: true code-fold: true code-copy: true code-tools: truecategories: - challenge_4 - abc_poll - eggs - fed_rates - hotel_bookings - debt---```{r}#| label: setup#| warning: false#| message: falselibrary(tidyverse)knitr::opts_chunk$set(echo =TRUE, warning=FALSE, message=FALSE)```## Challenge OverviewToday's challenge is to:1) read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)2) tidy data (as needed, including sanity checks)3) identify variables that need to be mutated4) mutate variables and sanity check all mutations## Read in dataRead in one (or more) of the following datasets, using the correct R package and command.- abc_poll.csv ⭐- poultry_tidy.xlsx or organiceggpoultry.xls⭐⭐- FedFundsRate.csv⭐⭐⭐- hotel_bookings.csv⭐⭐⭐⭐- debt_in_trillions.xlsx ⭐⭐⭐⭐⭐For this challenge I will be working with the "hotel_bookings" dataset. It is a publicly available data set containing booking transactions from a city hotel and a resort hotel in Portugal.I used the "hotel_bookings" dataset previously for the Challenge 2 and I will be using the same content for the section 1: read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc).```{r}# Reading the hotel_bookings.csv data set and storing in a data framehotel_data <-read_csv("_data/hotel_bookings.csv")print(hotel_data)```### Briefly describe the dataAfter reading the data using read_csv function, it is stored in a dataframe "hotel_data". The hotel_bookings data set contains 119390 rows (observations) and 32 columns (attributes/variables) including information such as when the booking was made, length of stay, the number of adults, children, and/or babies, the number of available parking spaces, and many others for the respective hotels. Each row/observation represents a hotel booking. Since this is a public data set, all data elements pertaining to hotel or customer identification are not included in the data set. This data set can have an important role for research and educational purposes in hotel administration, revenue management, machine learning, data mining and other fields. This data set will be helpful for (i) Tourists (people who are booking hotels) to check and understand trends of hotel price over a period of time and plan their travel accordingly within budget, learn about hotel accommodations and features before booking (ii) Hotel Management to keep track of the relevant information about themselves as well as their competitors. Understand and analyze the seasonal trend of hotel booking and accommodate different types of visitors that they have (iii) Tourist / hospitality services - (e.g. travel agency, airlines / car rental companies) to observe the times when hotels in the region are in high demand, analyze the duration of typical stays, and use the information to help plan their own services (iv) Local Government / independent data analysts to observe the overall trend of tourist activities in the region and analyzing the different types of visitors in the hotels during different seasons.Using the "dfSummary" function from "summarytools" package we find that there are 31994 duplicates in the data. The reason for the identified duplication is that there is no unique id for each booking. It is possible that the booking was made by different tourists and the values for each attribute was exactly the same. This confusion could have been avoided by adding a Booking ID which would be unique for each booking. 66.4% of the data represents city hotel and the remaining 33.6% of the data represents resort hotel. 18.4%, 47.5% and 34.1% of the data correspondingly represents years 2015, 2016 and 2017. I wanted to check the unequal distribution in data for the 3 consecutive years. I further investigated the data and found that the data we have represents hotel bookings from the period July 2015 - August 2017.```{r}#Finding dimension of the data setdim(hotel_data)``````{r}#Finding column namescolnames(hotel_data)``````{r}#Structure of hotel_datastr(hotel_data)``````{r}#Summary of hotel_datasummary(hotel_data)``````{r}#Summary of hotel_datalibrary(summarytools)print(summarytools::dfSummary(hotel_data,varnumbers =FALSE,plain.ascii =FALSE, style ="grid", graph.magnif =0.60, valid.col =FALSE),method ='render',table.classes ='table-condensed')``````{r}#Check for the period of hotel booking dateshotel_data%>%select(hotel, arrival_date_year, arrival_date_month)%>%distinct()``````{r}#Check for missing/null data in the hotel_datasum(is.na(hotel_data))sum(is.null(hotel_data))```We find that there are 4 NA's or missing values in the dataset.```{r}# Checking which columns have NA valuescol <-colnames(hotel_data)for (c in col){print(paste0("NA values in ", c, ": ", sum(is.na(hotel_data[,c]))))}```We can see that all 4 NA's in the hotel booking dataset are from the column "Children". We can either replace the missing values with 0 or we can drop the 4 rows with NA values. I chose to drop the 4 rows as removing 4 rows from a huge dataset with 119390 observations/rows would not affect any of the statistics significantly.```{r}# Checking which columns have NULL valuescol <-colnames(hotel_data)for (c in col){print(paste0("NULL values in ", c, ": ", sum(is.null(hotel_data[,c]))))}``````{r}# Checking which columns have character datatype and have value == "NULL"hotel_data_subset <- hotel_data%>%select_if(is.character)col <-colnames(hotel_data_subset)for (c in col){print(paste0("NULL values in ", c, ": ", sum(hotel_data[,c]=="NULL")))}``````{r}length(unique(hotel_data$country))table(hotel_data$country)```We can see that there are bookings from people belonging to 178 distinct countries. However, from the output of table() we can see that one country is given "NULL" as the value is unknown. Hence, we can say that there are 177 distinct countries in the hotel_bookings dataset. In future, we may have to drop the rows with "NULL" country if we plan to plot geospatial visualizations. ```{r}table(hotel_data$agent)``````{r}table(hotel_data$company)```The datatype is character for both the columns "agent" and "company" due to which the numbers are not sorted/arranged as expected. From the table() we notice that both the columns have all numerical values except for the "NULL" value which is used for the bookings which did not use an agent or a company for booking. If we change these "NULL" string values to a numerical value like -1 (as no negative values are being used in these columns), then we can change the column type to numeric.## Tidy Data (as needed)Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.As the first step of tidying the data, I dropped the rows with NA values in "children" column (4 rows to be exact).```{r}# Dropping the rows with NA values in "children" columnhotel_data <- hotel_data%>%subset(!is.na(children))hotel_data```Next, I replaced the "NULL" values with "-1" values in "agent" and "company" columns.```{r}# Replace the "NULL" values with "-1" in "agent" and "company" columnshotel_data <- hotel_data%>%mutate(agent =str_replace(agent, "NULL", "-1"))%>%mutate(company =str_replace(company, "NULL", "-1"))```I also checked that all values are numerical in the "agent" and "company" columns (i.e no "NULL" values).```{r}# Sanity check: Checking that all values are numerical in the "agent" and "company" columns (i.e no "NULL" values)table(hotel_data$agent)table(hotel_data$company)```Finally, I converted the datatype of "agent" and "company" columns from character to numeric.```{r}# Converting the datatype of "agent" and "company" columns from character to numerichotel_data <- hotel_data%>%mutate(agent =as.numeric(agent))%>%mutate(company =as.numeric(company))```I verified that the the new datatype of "agent" and "company" is numeric.```{r}# Sanity check: Verify the new datatype of "agent" and "company" is numericstr(hotel_data)```## Identify variables that need to be mutatedAre there any variables that require mutation to be usable in your analysis stream? For example, are all time variables correctly coded as dates? Are all string variables reduced and cleaned to sensible categories? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?Document your work here.Knowing the demand i.e total guests staying in the booked hotel during a time frame would help in visualizing trends in the form of line plots. These trends would be helpful for tourists to identify the best time to visit Portugal and book the rooms earlier for a lesser price or for hotel management/travel agents to inflate the prices of the rooms according to the demand. We can calculate demand as the sum of adults, children and babies.```{r}# Calculate demand as the sum of adults, children and babieshotel_data <- hotel_data%>%mutate(demand = adults+children+babies)hotel_data``````{r}# Interesting fact about demandtable(hotel_data$demand)hotel_data_demand0 <- hotel_data%>%subset(demand==0)table(hotel_data_demand0$reservation_status)```Interesting fact! After creating the "demand" attribute, I performed the table() and found that there are 180 bookings with the demand listed as 0. On checking the reservation status for these 180 rows, 155 bookings show that the reservation status is "Check-Out". According to the dataset, "Check-Out" is defined as – customer has checked in but already departed. It is surprising that the customer checked in and out but the demand is 0! Would like to know more about the reason behind this.```{r}# Combine the columns "arrival_date_year", "arrival_date_month", "arrival_date_day_of_month" to get the arrival date in a single column. library(lubridate)hotel_data <- hotel_data%>%mutate(arrival_date =ymd(paste(hotel_data$arrival_date_year, hotel_data$arrival_date_month, hotel_data$arrival_date_day_of_month, sep="/")))#Removing the columns related to date in the dataset except for the "arrival_date" mutated columnhotel_data <- hotel_data%>%select(-c(arrival_date_year, arrival_date_month, arrival_date_week_number, arrival_date_day_of_month))```I created a new column "arrival_date" which combined the data from "arrival_date_year", "arrival_date_month", and "arrival_date_day_of_month" to get the arrival date in a single column and removed the columns "arrival_date_year", "arrival_date_month", "arrival_date_day_of_month", and "arrival_date_week_number" as they are redundant data. The mutated "arrival_date" column will be useful to plot and analyze trends.```{r}# Find the min and max arrival_datemin(hotel_data$arrival_date)max(hotel_data$arrival_date)```From the mutated variable "arrival_date" we can easily understand that the "hotel_bookings" dataset has data for the arrival period of "2015-07-01" to "2017-08-31".Currently, the data contains information about the "lead_time" (Number of days that elapsed between the date of hotel booking and the arrival date) and the "arrival_date" at the hotel. It would be useful to create visualizations between the "arrival_date", "booking_date" and "adr" for insights. For this purpose, it would be suitable if the "booking_date" was calculated from "arrival_date" and the "lead_time". This will help customers to understand the right time to book hotels and the demand.```{r}# Calculating "booking_date" variable from "arrival_date" and "lead_time"hotel_data <- hotel_data%>%mutate(booking_date = arrival_date - lead_time)hotel_data``````{r}# Summary of booking_datesummary(hotel_data$booking_date)```We can see that the earliest hotel booking for the period of arrival from "2015-07-01" to "2017-08-31" was done on the date "2013-06-24". This is a lead_time of 737 days!```{r}# Sort the dataset based on arrival_date.hotel_data <- hotel_data%>%arrange(arrival_date)```The final dataset is sorted based on "arrival_date" in ascending order.Any additional comments?After tidying the data and mutating variables, we are left with a dataset of 119386 rows/observations and 32 columns/variables. We can use this dataset to perform visualizations and generate insights.