Code
library(tidyverse)
library(lubridate)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Shriya
November 27, 2022
Today’s challenge is to:
Read in one (or more) of the following datasets, using the correct R package and command.
# A tibble: 119,390 × 32
hotel is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
<chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Resor… 0 342 2015 July 27 1 0 0 2
2 Resor… 0 737 2015 July 27 1 0 0 2
3 Resor… 0 7 2015 July 27 1 0 1 1
4 Resor… 0 13 2015 July 27 1 0 1 1
5 Resor… 0 14 2015 July 27 1 0 2 2
6 Resor… 0 14 2015 July 27 1 0 2 2
7 Resor… 0 0 2015 July 27 1 0 2 2
8 Resor… 0 9 2015 July 27 1 0 2 2
9 Resor… 1 85 2015 July 27 1 0 3 2
10 Resor… 1 75 2015 July 27 1 0 3 2
# … with 119,380 more rows, 22 more variables: children <dbl>, babies <dbl>,
# meal <chr>, country <chr>, market_segment <chr>,
# distribution_channel <chr>, is_repeated_guest <dbl>,
# previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
# reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
# deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
# customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …
The above dataset consists data from primarily 2 different hotels ‘City Hotel’ and ‘Resort Hotel’ and has 119390 rows and 31 columns. Each row contains the information about the customer booking like customer type reservation status, number of adults/children/babies in the room, arrival date, stay days.
The data can be tidied up in the following way
ABW AGO AIA ALB AND ARE ARG ARM ASM ATA ATF AUS AUT
2 362 1 12 7 51 214 8 1 2 1 426 1263
AZE BDI BEL BEN BFA BGD BGR BHR BHS BIH BLR BOL BRA
17 1 2342 3 1 12 75 5 1 13 26 10 2224
BRB BWA CAF CHE CHL CHN CIV CMR CN COL COM CPV CRI
4 1 5 1730 65 999 6 10 1279 71 2 24 19
CUB CYM CYP CZE DEU DJI DMA DNK DOM DZA ECU EGY ESP
8 1 51 171 7287 1 1 435 14 103 27 32 8568
EST ETH FIN FJI FRA FRO GAB GBR GEO GGY GHA GIB GLP
83 3 447 1 10415 5 4 12129 22 3 4 18 2
GNB GRC GTM GUY HKG HND HRV HUN IDN IMN IND IRL IRN
9 128 4 1 29 1 100 230 35 2 152 3375 83
IRQ ISL ISR ITA JAM JEY JOR JPN KAZ KEN KHM KIR KNA
14 57 669 3766 6 8 21 197 19 6 2 1 2
KOR KWT LAO LBN LBY LCA LIE LKA LTU LUX LVA MAC MAR
133 16 2 31 8 1 3 7 81 287 55 16 259
MCO MDG MDV MEX MKD MLI MLT MMR MNE MOZ MRT MUS MWI
4 1 12 85 10 1 18 1 5 67 1 7 2
MYS MYT NAM NCL NGA NIC NLD NOR NPL NULL NZL OMN PAK
28 2 1 1 34 1 2104 607 1 488 74 18 14
PAN PER PHL PLW POL PRI PRT PRY PYF QAT ROU RUS RWA
9 29 40 1 919 12 48590 4 1 15 500 632 2
SAU SDN SEN SGP SLE SLV SMR SRB STP SUR SVK SVN SWE
48 1 11 39 1 2 1 101 2 5 65 57 1024
SYC SYR TGO THA TJK TMP TUN TUR TWN TZA UGA UKR UMI
2 3 2 59 9 3 39 248 51 5 2 68 1
URY USA UZB VEN VGB VNM ZAF ZMB ZWE
32 2097 4 26 1 8 80 2 4
As there is NULL value in the country column, it can be removed.
hotel is_canceled
"character" "numeric"
lead_time arrival_date_year
"numeric" "numeric"
arrival_date_month arrival_date_week_number
"character" "numeric"
arrival_date_day_of_month stays_in_weekend_nights
"numeric" "numeric"
stays_in_week_nights adults
"numeric" "numeric"
children babies
"numeric" "numeric"
meal country
"character" "character"
market_segment distribution_channel
"character" "character"
is_repeated_guest previous_cancellations
"numeric" "numeric"
previous_bookings_not_canceled reserved_room_type
"numeric" "character"
assigned_room_type booking_changes
"character" "numeric"
deposit_type agent
"character" "character"
company days_in_waiting_list
"character" "numeric"
customer_type adr
"character" "numeric"
required_car_parking_spaces total_of_special_requests
"numeric" "numeric"
reservation_status reservation_status_date
"character" "Date"
Agents and Company have numerical values but their classes are described as character. Hence, the NULL can be modified to NA and the column can be modified to the numerical value. Arrival date in year, month and day can be combined into a single field arrival date.
Some variables that can be mutated.
#Mutating arrival date into a single field
#finding the total number guests.
dataset_mutate <- dataset %>%
mutate(arrival_date = str_c(arrival_date_day_of_month,
arrival_date_month,
arrival_date_year, sep="/"),
arrival_date = dmy(arrival_date),
total_guests = adults + children + babies) %>%
select(-c(arrival_date_day_of_month,arrival_date_month,arrival_date_year))
dataset_mutate
# A tibble: 118,902 × 31
hotel is_ca…¹ lead_…² arriv…³ stays…⁴ stays…⁵ adults child…⁶ babies meal
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 Resort H… 0 342 27 0 0 2 0 0 BB
2 Resort H… 0 737 27 0 0 2 0 0 BB
3 Resort H… 0 7 27 0 1 1 0 0 BB
4 Resort H… 0 13 27 0 1 1 0 0 BB
5 Resort H… 0 14 27 0 2 2 0 0 BB
6 Resort H… 0 14 27 0 2 2 0 0 BB
7 Resort H… 0 0 27 0 2 2 0 0 BB
8 Resort H… 0 9 27 0 2 2 0 0 FB
9 Resort H… 1 85 27 0 3 2 0 0 BB
10 Resort H… 1 75 27 0 3 2 0 0 HB
# … with 118,892 more rows, 21 more variables: country <chr>,
# market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,
# previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
# reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
# deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
# customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>,
# total_of_special_requests <dbl>, reservation_status <chr>, …
Min. 1st Qu. Median Mean 3rd Qu. Max.
"2015-07-01" "2016-03-14" "2016-09-07" "2016-08-29" "2017-03-19" "2017-08-31"
We can see that the arrival date mostly lie between the July 2015- August 2017.
As described above, the type of the agent and company can be changed from character to numeric.
Is agent field numeric?
Is company field numeric?
---
title: "Challenge 4"
author: "Shriya"
desription: "More data wrangling: pivoting"
date: "11/27/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_4
- abc_poll
- eggs
- fed_rates
- hotel_bookings
- debt
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
library(lubridate)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to:
1) read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
2) tidy data (as needed, including sanity checks)
3) identify variables that need to be mutated
4) mutate variables and sanity check all mutations
## Read in data
Read in one (or more) of the following datasets, using the correct R package and command.
- abc_poll.csv ⭐
- poultry_tidy.xlsx or organiceggpoultry.xls⭐⭐
- FedFundsRate.csv⭐⭐⭐
- hotel_bookings.csv⭐⭐⭐⭐
- debt_in_trillions.xlsx ⭐⭐⭐⭐⭐
```{r}
dataset <- read_csv("_data/hotel_bookings.csv", show_col_types = FALSE)
dataset
```
### Briefly describe the data
The above dataset consists data from primarily 2 different hotels 'City Hotel' and 'Resort Hotel' and has 119390 rows and 31 columns. Each row contains the information about the customer booking like customer type reservation status, number of adults/children/babies in the room, arrival date, stay days.
## Tidy Data (as needed)
The data can be tidied up in the following way
```{r}
table(dataset$country)
```
As there is NULL value in the country column, it can be removed.
```{r}
dataset <- dataset %>%
filter(!(country == "NULL"))
```
```{r}
#Class of each column in the dataset.
sapply(dataset, class)
```
Agents and Company have numerical values but their classes are described as character. Hence, the NULL can be modified to NA and the column can be modified to the numerical value.
Arrival date in year, month and day can be combined into a single field arrival date.
## Identify variables that need to be mutated
Some variables that can be mutated.
```{r}
#Mutating arrival date into a single field
#finding the total number guests.
dataset_mutate <- dataset %>%
mutate(arrival_date = str_c(arrival_date_day_of_month,
arrival_date_month,
arrival_date_year, sep="/"),
arrival_date = dmy(arrival_date),
total_guests = adults + children + babies) %>%
select(-c(arrival_date_day_of_month,arrival_date_month,arrival_date_year))
dataset_mutate
```
```{r}
summary(dataset_mutate$arrival_date)
```
We can see that the arrival date mostly lie between the July 2015- August 2017.
As described above, the type of the agent and company can be changed from character to numeric.
```{r}
#Mutating the class of the agent and company field from character to numeric
dataset_mutate <- dataset_mutate %>%
mutate(across(c(agent, company),~ replace(.,str_detect(., "NULL"), NA))) %>% mutate_at(vars(agent, company),as.numeric)
```
Is agent field numeric?
```{r}
is.numeric(dataset_mutate$agent)
```
Is company field numeric?
```{r}
is.numeric(dataset_mutate$company)
```