Code
library(tidyverse)
library(lubridate)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Meredith Rolfe
August 18, 2022
Today’s challenge is to:
Read in one (or more) of the following datasets, using the correct R package and command.
# A tibble: 119,390 × 32
hotel is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
<chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Resor… 0 342 2015 July 27 1 0 0 2
2 Resor… 0 737 2015 July 27 1 0 0 2
3 Resor… 0 7 2015 July 27 1 0 1 1
4 Resor… 0 13 2015 July 27 1 0 1 1
5 Resor… 0 14 2015 July 27 1 0 2 2
6 Resor… 0 14 2015 July 27 1 0 2 2
7 Resor… 0 0 2015 July 27 1 0 2 2
8 Resor… 0 9 2015 July 27 1 0 2 2
9 Resor… 1 85 2015 July 27 1 0 3 2
10 Resor… 1 75 2015 July 27 1 0 3 2
# … with 119,380 more rows, 22 more variables: children <dbl>, babies <dbl>,
# meal <chr>, country <chr>, market_segment <chr>,
# distribution_channel <chr>, is_repeated_guest <dbl>,
# previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
# reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
# deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
# customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …
Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.
Let’s use lubridate package, so we can check unique values in all the columns.
from the given result- 119,390 × 32, we can say dataset has 119390 rows and 31 columns so, each row in the data set belongs to the hotel bookings done by each customer and it describe- type, reservation status, arrival date, no. of the adult or children in the room, room type, customer, stay days, meal, previous cancellations.
hotel is_canceled
2 2
lead_time arrival_date_year
479 3
arrival_date_month arrival_date_week_number
12 53
arrival_date_day_of_month stays_in_weekend_nights
31 17
stays_in_week_nights adults
35 14
children babies
6 5
meal country
5 178
market_segment distribution_channel
8 5
is_repeated_guest previous_cancellations
2 15
previous_bookings_not_canceled reserved_room_type
73 10
assigned_room_type booking_changes
12 21
deposit_type agent
3 334
company days_in_waiting_list
353 128
customer_type adr
4 8879
required_car_parking_spaces total_of_special_requests
5 6
reservation_status reservation_status_date
3 926
Any additional comments?
Are there any variables that require mutation to be usable in your analysis stream? For example, are all time variables correctly coded as dates? Are all string variables reduced and cleaned to sensible categories? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?
Document your work here.
ABW AGO AIA ALB AND ARE ARG ARM ASM ATA ATF AUS AUT
2 362 1 12 7 51 214 8 1 2 1 426 1263
AZE BDI BEL BEN BFA BGD BGR BHR BHS BIH BLR BOL BRA
17 1 2342 3 1 12 75 5 1 13 26 10 2224
BRB BWA CAF CHE CHL CHN CIV CMR CN COL COM CPV CRI
4 1 5 1730 65 999 6 10 1279 71 2 24 19
CUB CYM CYP CZE DEU DJI DMA DNK DOM DZA ECU EGY ESP
8 1 51 171 7287 1 1 435 14 103 27 32 8568
EST ETH FIN FJI FRA FRO GAB GBR GEO GGY GHA GIB GLP
83 3 447 1 10415 5 4 12129 22 3 4 18 2
GNB GRC GTM GUY HKG HND HRV HUN IDN IMN IND IRL IRN
9 128 4 1 29 1 100 230 35 2 152 3375 83
IRQ ISL ISR ITA JAM JEY JOR JPN KAZ KEN KHM KIR KNA
14 57 669 3766 6 8 21 197 19 6 2 1 2
KOR KWT LAO LBN LBY LCA LIE LKA LTU LUX LVA MAC MAR
133 16 2 31 8 1 3 7 81 287 55 16 259
MCO MDG MDV MEX MKD MLI MLT MMR MNE MOZ MRT MUS MWI
4 1 12 85 10 1 18 1 5 67 1 7 2
MYS MYT NAM NCL NGA NIC NLD NOR NPL NULL NZL OMN PAK
28 2 1 1 34 1 2104 607 1 488 74 18 14
PAN PER PHL PLW POL PRI PRT PRY PYF QAT ROU RUS RWA
9 29 40 1 919 12 48590 4 1 15 500 632 2
SAU SDN SEN SGP SLE SLV SMR SRB STP SUR SVK SVN SWE
48 1 11 39 1 2 1 101 2 5 65 57 1024
SYC SYR TGO THA TJK TMP TUN TUR TWN TZA UGA UKR UMI
2 3 2 59 9 3 39 248 51 5 2 68 1
URY USA UZB VEN VGB VNM ZAF ZMB ZWE
32 2097 4 26 1 8 80 2 4
We can see lot of NULL value in the country column, so we should remove this because it won’t help us.
To see the class for each columns let’s do,
hotel is_canceled
"character" "numeric"
lead_time arrival_date_year
"numeric" "numeric"
arrival_date_month arrival_date_week_number
"character" "numeric"
arrival_date_day_of_month stays_in_weekend_nights
"numeric" "numeric"
stays_in_week_nights adults
"numeric" "numeric"
children babies
"numeric" "numeric"
meal country
"character" "character"
market_segment distribution_channel
"character" "character"
is_repeated_guest previous_cancellations
"numeric" "numeric"
previous_bookings_not_canceled reserved_room_type
"numeric" "character"
assigned_room_type booking_changes
"character" "numeric"
deposit_type agent
"character" "character"
company days_in_waiting_list
"character" "numeric"
customer_type adr
"character" "numeric"
required_car_parking_spaces total_of_special_requests
"numeric" "numeric"
reservation_status reservation_status_date
"character" "Date"
Now we can mutate the arrival date into one single field and after that we can also mutate the adults and children so we can get the total guests in the hotel and then we will see the range of the arrival date of the bookings.
# A tibble: 118,902 × 31
hotel is_ca…¹ lead_…² arriv…³ stays…⁴ stays…⁵ adults child…⁶ babies meal
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 Resort H… 0 342 27 0 0 2 0 0 BB
2 Resort H… 0 737 27 0 0 2 0 0 BB
3 Resort H… 0 7 27 0 1 1 0 0 BB
4 Resort H… 0 13 27 0 1 1 0 0 BB
5 Resort H… 0 14 27 0 2 2 0 0 BB
6 Resort H… 0 14 27 0 2 2 0 0 BB
7 Resort H… 0 0 27 0 2 2 0 0 BB
8 Resort H… 0 9 27 0 2 2 0 0 FB
9 Resort H… 1 85 27 0 3 2 0 0 BB
10 Resort H… 1 75 27 0 3 2 0 0 HB
# … with 118,892 more rows, 21 more variables: country <chr>,
# market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,
# previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
# reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
# deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
# customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>,
# total_of_special_requests <dbl>, reservation_status <chr>, …
Min. 1st Qu. Median Mean 3rd Qu. Max.
"2015-07-01" "2016-03-14" "2016-09-07" "2016-08-29" "2017-03-19" "2017-08-31"
From the above information we can see that arrival dates now lie between 2015-07-01 and 2017-08-31.
# A tibble: 118,902 × 32
hotel is_ca…¹ lead_…² arriv…³ stays…⁴ stays…⁵ adults child…⁶ babies meal
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 Resort H… 0 342 27 0 0 2 0 0 BB
2 Resort H… 0 737 27 0 0 2 0 0 BB
3 Resort H… 0 7 27 0 1 1 0 0 BB
4 Resort H… 0 13 27 0 1 1 0 0 BB
5 Resort H… 0 14 27 0 2 2 0 0 BB
6 Resort H… 0 14 27 0 2 2 0 0 BB
7 Resort H… 0 0 27 0 2 2 0 0 BB
8 Resort H… 0 9 27 0 2 2 0 0 FB
9 Resort H… 1 85 27 0 3 2 0 0 BB
10 Resort H… 1 75 27 0 3 2 0 0 HB
# … with 118,892 more rows, 22 more variables: country <chr>,
# market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,
# previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
# reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
# deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
# customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>,
# total_of_special_requests <dbl>, reservation_status <chr>, …
Now it adds a new column to the hotel_mutate
data frame called booking_date
, which is calculated by subtracting the lead_time
column from the arrival_date
column. This new column represents the date on which the hotel reservation was booked. After adding the new booking_date
column, it returns the updated hotel_mutate
data frame.
Error in unique(hotels_booking$reservation_status): object 'hotels_booking' not found
Now we will find that when the status was changed to cancelled after booking and also after that we will Mutate the class of the agent and company field to numeric.
hotel_cancellations <- filter(hotel_mutate, reservation_status == 'Canceled')
hotel_cancellations <- mutate(hotel_cancellations, canc_time = booking_date - reservation_status_date)
summarise(hotel_cancellations, min = min(canc_time), max = max(canc_time), mean = mean(canc_time), median = median(canc_time))
# A tibble: 1 × 4
min max mean median
<drtn> <drtn> <drtn> <drtn>
1 -584 days 0 days -58.9916 days -26 days
Here, we filters the hotel_mutate
data frame to only include rows where the reservation_status
column is equal to ‘Canceled’. Then, It adds a new column called canc_time
to the hotel_cancellations
data frame, which is calculated by subtracting the reservation_status_date
column from the booking_date
column and calculates summary statistics for the canc_time
column, including the minimum, maximum, mean, and median values.
[1] TRUE
First, the across
function is used to replace all occurrences of the string "NULL"
with NA
in the agent
and company
columns.
Then, the mutate_at
function is used to convert the agent
and company
columns to numeric data type.
Finally, the is.numeric
function is used to check if the agent
column has been successfully converted to numeric data type.
Now, we will check did it got change to numeric or not.
Let’s try a new data set which is debt in trillion
# A tibble: 6 × 8
`Year and Quarter` Mortgage `HE Revolving` Auto …¹ Credi…² Stude…³ Other Total
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 03:Q1 4.94 0.242 0.641 0.688 0.241 0.478 7.23
2 03:Q2 5.08 0.26 0.622 0.693 0.243 0.486 7.38
3 03:Q3 5.18 0.269 0.684 0.693 0.249 0.477 7.56
4 03:Q4 5.66 0.302 0.704 0.698 0.253 0.449 8.07
5 04:Q1 5.84 0.328 0.72 0.695 0.260 0.446 8.29
6 04:Q2 5.97 0.367 0.743 0.697 0.263 0.423 8.46
# … with abbreviated variable names ¹`Auto Loan`, ²`Credit Card`,
# ³`Student Loan`
Data shows the amount of debt in mortgage, HE revolving, auto loan, credit card, student loan, other, and total for each quarter from 2003 - 2021.
tibble [74 × 8] (S3: tbl_df/tbl/data.frame)
$ Year and Quarter: chr [1:74] "03:Q1" "03:Q2" "03:Q3" "03:Q4" ...
$ Mortgage : num [1:74] 4.94 5.08 5.18 5.66 5.84 ...
$ HE Revolving : num [1:74] 0.242 0.26 0.269 0.302 0.328 0.367 0.426 0.468 0.502 0.528 ...
$ Auto Loan : num [1:74] 0.641 0.622 0.684 0.704 0.72 0.743 0.751 0.728 0.725 0.774 ...
$ Credit Card : num [1:74] 0.688 0.693 0.693 0.698 0.695 0.697 0.706 0.717 0.71 0.717 ...
$ Student Loan : num [1:74] 0.241 0.243 0.249 0.253 0.26 ...
$ Other : num [1:74] 0.478 0.486 0.477 0.449 0.447 ...
$ Total : num [1:74] 7.23 7.38 7.56 8.07 8.29 ...
Now we will try to create a column with only year and quarter, basically we will separate it.
Now we will create a data which is calculated by year
# A tibble: 6 × 7
year Mortgage `HE Revolving` `Auto Loan` `Credit Card` `Student Loan` Other
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3 5.22 0.268 0.663 0.693 0.246 0.472
2 4 6.09 0.397 0.736 0.704 0.300 0.426
3 5 6.80 0.534 0.780 0.724 0.377 0.404
4 6 7.87 0.595 0.806 0.746 0.450 0.422
5 7 8.79 0.626 0.808 0.804 0.524 0.412
6 8 9.26 0.685 0.804 0.853 0.604 0.410
# A tibble: 6 × 3
year mean_mortgage mean_other
<int> <dbl> <dbl>
1 3 5.22 0.472
2 4 6.09 0.426
3 5 6.80 0.404
4 6 7.87 0.422
5 7 8.79 0.412
6 8 9.26 0.410
This approach allows for more explicit column names in the resulting data frame.
---
title: "Challenge 4 Instructions"
author: "Meredith Rolfe"
desription: "More data wrangling: pivoting"
date: "08/18/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_4
- abc_poll
- eggs
- fed_rates
- hotel_bookings
- debt
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
library(lubridate)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to:
1) read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
2) tidy data (as needed, including sanity checks)
3) identify variables that need to be mutated
4) mutate variables and sanity check all mutations
## Read in data
Read in one (or more) of the following datasets, using the correct R package and command.
- abc_poll.csv ⭐
- poultry_tidy.xlsx or organiceggpoultry.xls⭐⭐
- FedFundsRate.csv⭐⭐⭐
- hotel_bookings.csv⭐⭐⭐⭐
- debt_in_trillions.xlsx ⭐⭐⭐⭐⭐
## Hotels booking data- (hotel_bookings.csv)
```{r}
hotel <- read_csv("_data/hotel_bookings.csv", show_col_types = FALSE)
hotel
```
### Briefly describe the data
## Tidy Data (as needed)
Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.
Let's use lubridate package, so we can check unique values in all the columns.
from the given result- 119,390 × 32, we can say dataset has 119390 rows and 31 columns so, each row in the data set belongs to the hotel bookings done by each customer and it describe- type, reservation status, arrival date, no. of the adult or children in the room, room type, customer, stay days, meal, previous cancellations.
```{r}
rapply(hotel,function(x)length(unique(x)))
```
Any additional comments?
## Identify variables that need to be mutated
Are there any variables that require mutation to be usable in your analysis stream? For example, are all time variables correctly coded as dates? Are all string variables reduced and cleaned to sensible categories? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?
Document your work here.
```{r}
table(hotel$country)
```
We can see lot of NULL value in the country column, so we should remove this because it won't help us.
```{r}
hotel <- hotel %>%
filter(!(country == "NULL"))
```
To see the class for each columns let's do,
```{r}
sapply(hotel, class)
```
Now we can mutate the arrival date into one single field and after that we can also mutate the adults and children so we can get the total guests in the hotel and then we will see the range of the arrival date of the bookings.
```{r}
hotel_mutate <- hotel %>%
mutate(arrival_date = str_c(arrival_date_day_of_month,
arrival_date_month,
arrival_date_year, sep="/"),
arrival_date = dmy(arrival_date),
total_guests = adults + children + babies) %>%
select(-c(arrival_date_day_of_month,arrival_date_month,arrival_date_year))
hotel_mutate
```
```{r}
summary(hotel_mutate$arrival_date)
```
From the above information we can see that arrival dates now lie between 2015-07-01 and 2017-08-31.
```{r}
hotel_mutate <- hotel_mutate %>%
mutate(booking_date = arrival_date - lead_time)
hotel_mutate
```
Now it adds a new column to the **`hotel_mutate`** data frame called **`booking_date`**, which is calculated by subtracting the **`lead_time`** column from the **`arrival_date`** column. This new column represents the date on which the hotel reservation was booked. After adding the new **`booking_date`** column, it returns the updated **`hotel_mutate`** data frame.
```{r}
unique(hotels_booking$reservation_status)
```
Now we will find that when the status was changed to cancelled after booking and also after that we will Mutate the class of the agent and company field to numeric.\
```{r}
hotel_cancellations <- filter(hotel_mutate, reservation_status == 'Canceled')
hotel_cancellations <- mutate(hotel_cancellations, canc_time = booking_date - reservation_status_date)
summarise(hotel_cancellations, min = min(canc_time), max = max(canc_time), mean = mean(canc_time), median = median(canc_time))
```
Here, we filters the **`hotel_mutate`** data frame to only include rows where the **`reservation_status`** column is equal to 'Canceled'. Then, It adds a new column called **`canc_time`** to the **`hotel_cancellations`** data frame, which is calculated by subtracting the **`reservation_status_date`** column from the **`booking_date`** column and calculates summary statistics for the **`canc_time`** column, including the minimum, maximum, mean, and median values.
```{r}
hotel_mutate <- hotel_mutate %>%
mutate(across(c(agent, company), ~ ifelse(str_detect(., "NULL"), NA, .))) %>%
mutate_at(vars(agent, company), as.numeric)
is.numeric(hotel_mutate$agent)
```
First, the **`across`** function is used to replace all occurrences of the string **`"NULL"`** with **`NA`** in the **`agent`** and **`company`** columns.
Then, the **`mutate_at`** function is used to convert the **`agent`** and **`company`** columns to numeric data type.
Finally, the **`is.numeric`** function is used to check if the **`agent`** column has been successfully converted to numeric data type.
Now, we will check did it got change to numeric or not.
```{r}
is.numeric(hotel_mutate$company)
```
## Debt in trillions data- (debt_in_trillions.xlsx)
```{r}
library(stringr)
library(tidyr)
library(dplyr)
library(ggplot2)
```
Let's try a new data set which is debt in trillion
```{r}
library(readxl)
data <- read_excel("_data/debt_in_trillions.xlsx")
head(data)
```
Data shows the amount of debt in mortgage, HE revolving, auto loan, credit card, student loan, other, and total for each quarter from 2003 - 2021.
```{r}
str(data)
```
Now we will try to create a column with only year and quarter, basically we will separate it.
```{r}
newdata <- data %>%
separate(`Year and Quarter`, c("year", "quarter"))
newdata$year <- as.integer(newdata$year)
```
Now we will create a data which is calculated by year
```{r}
yeardata <- newdata %>%
group_by(year) %>%
summarise_at (vars(Mortgage:Other), list(mean))
head(yeardata)
```
```{r}
yeardata <- newdata %>%
group_by(year) %>%
summarise(mean_mortgage = mean(Mortgage), mean_other = mean(Other))
head(yeardata)
```
This approach allows for more explicit column names in the resulting data frame.