DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 4

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Read in data
    • Briefly describe the data
  • Tidy Data (as needed)
  • Identify variables that need to be mutated

Challenge 4

  • Show All Code
  • Hide All Code

  • View Source
challenge_4
abc_poll
eggs
fed_rates
hotel_bookings
debt
Author

Neeharika Karanam

Published

December 2, 2022

Code
library(tidyverse)
library(lubridate)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read in data

  • hotel_bookings.csv⭐⭐⭐⭐
Code
# Reading in the hotels booking data
hotels_booking <- read_csv("_data/hotel_bookings.csv", show_col_types = FALSE)

hotels_booking
# A tibble: 119,390 × 32
   hotel  is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
   <chr>    <dbl>   <dbl>   <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
 1 Resor…       0     342    2015 July         27       1       0       0      2
 2 Resor…       0     737    2015 July         27       1       0       0      2
 3 Resor…       0       7    2015 July         27       1       0       1      1
 4 Resor…       0      13    2015 July         27       1       0       1      1
 5 Resor…       0      14    2015 July         27       1       0       2      2
 6 Resor…       0      14    2015 July         27       1       0       2      2
 7 Resor…       0       0    2015 July         27       1       0       2      2
 8 Resor…       0       9    2015 July         27       1       0       2      2
 9 Resor…       1      85    2015 July         27       1       0       3      2
10 Resor…       1      75    2015 July         27       1       0       3      2
# … with 119,380 more rows, 22 more variables: children <dbl>, babies <dbl>,
#   meal <chr>, country <chr>, market_segment <chr>,
#   distribution_channel <chr>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
#   reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
#   deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
#   customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …

In order to check the different unique values in all of the columns I have used the lubridate package.

Code
rapply(hotels_booking,function(x)length(unique(x)))
                         hotel                    is_canceled 
                             2                              2 
                     lead_time              arrival_date_year 
                           479                              3 
            arrival_date_month       arrival_date_week_number 
                            12                             53 
     arrival_date_day_of_month        stays_in_weekend_nights 
                            31                             17 
          stays_in_week_nights                         adults 
                            35                             14 
                      children                         babies 
                             6                              5 
                          meal                        country 
                             5                            178 
                market_segment           distribution_channel 
                             8                              5 
             is_repeated_guest         previous_cancellations 
                             2                             15 
previous_bookings_not_canceled             reserved_room_type 
                            73                             10 
            assigned_room_type                booking_changes 
                            12                             21 
                  deposit_type                          agent 
                             3                            334 
                       company           days_in_waiting_list 
                           353                            128 
                 customer_type                            adr 
                             4                           8879 
   required_car_parking_spaces      total_of_special_requests 
                             5                              6 
            reservation_status        reservation_status_date 
                             3                            926 

Briefly describe the data

From the above dataset I have observed that the hotel bookings dataset has 119390 rows and 31 columns also consisting of data from mainly two different hotels called the City Hotel and the Resort Hotel. Each row in the dataset corresponds to the hotel bookings made by each of the customer and it also has a lot of additional information like the arrival date, stay days, meal, type of the room, customer, type, reservation status, number of the adult/children/babies in the room, previous cancellations, if there is a car parking space and lots more information and this data is from different parts of the world.

Tidy Data (as needed)

I would like to tidy the data in the following way.

Code
table(hotels_booking$country)

  ABW   AGO   AIA   ALB   AND   ARE   ARG   ARM   ASM   ATA   ATF   AUS   AUT 
    2   362     1    12     7    51   214     8     1     2     1   426  1263 
  AZE   BDI   BEL   BEN   BFA   BGD   BGR   BHR   BHS   BIH   BLR   BOL   BRA 
   17     1  2342     3     1    12    75     5     1    13    26    10  2224 
  BRB   BWA   CAF   CHE   CHL   CHN   CIV   CMR    CN   COL   COM   CPV   CRI 
    4     1     5  1730    65   999     6    10  1279    71     2    24    19 
  CUB   CYM   CYP   CZE   DEU   DJI   DMA   DNK   DOM   DZA   ECU   EGY   ESP 
    8     1    51   171  7287     1     1   435    14   103    27    32  8568 
  EST   ETH   FIN   FJI   FRA   FRO   GAB   GBR   GEO   GGY   GHA   GIB   GLP 
   83     3   447     1 10415     5     4 12129    22     3     4    18     2 
  GNB   GRC   GTM   GUY   HKG   HND   HRV   HUN   IDN   IMN   IND   IRL   IRN 
    9   128     4     1    29     1   100   230    35     2   152  3375    83 
  IRQ   ISL   ISR   ITA   JAM   JEY   JOR   JPN   KAZ   KEN   KHM   KIR   KNA 
   14    57   669  3766     6     8    21   197    19     6     2     1     2 
  KOR   KWT   LAO   LBN   LBY   LCA   LIE   LKA   LTU   LUX   LVA   MAC   MAR 
  133    16     2    31     8     1     3     7    81   287    55    16   259 
  MCO   MDG   MDV   MEX   MKD   MLI   MLT   MMR   MNE   MOZ   MRT   MUS   MWI 
    4     1    12    85    10     1    18     1     5    67     1     7     2 
  MYS   MYT   NAM   NCL   NGA   NIC   NLD   NOR   NPL  NULL   NZL   OMN   PAK 
   28     2     1     1    34     1  2104   607     1   488    74    18    14 
  PAN   PER   PHL   PLW   POL   PRI   PRT   PRY   PYF   QAT   ROU   RUS   RWA 
    9    29    40     1   919    12 48590     4     1    15   500   632     2 
  SAU   SDN   SEN   SGP   SLE   SLV   SMR   SRB   STP   SUR   SVK   SVN   SWE 
   48     1    11    39     1     2     1   101     2     5    65    57  1024 
  SYC   SYR   TGO   THA   TJK   TMP   TUN   TUR   TWN   TZA   UGA   UKR   UMI 
    2     3     2    59     9     3    39   248    51     5     2    68     1 
  URY   USA   UZB   VEN   VGB   VNM   ZAF   ZMB   ZWE 
   32  2097     4    26     1     8    80     2     4 

From the above we can observe that there is a NULL value in the country column and we can remove this value because this information will not help us in futher analysis.

Code
hotels_booking <- hotels_booking %>% 
  filter(!(country == "NULL"))
Code
#To understand the class for each of the columns in the dataset.
sapply(hotels_booking, class)
                         hotel                    is_canceled 
                   "character"                      "numeric" 
                     lead_time              arrival_date_year 
                     "numeric"                      "numeric" 
            arrival_date_month       arrival_date_week_number 
                   "character"                      "numeric" 
     arrival_date_day_of_month        stays_in_weekend_nights 
                     "numeric"                      "numeric" 
          stays_in_week_nights                         adults 
                     "numeric"                      "numeric" 
                      children                         babies 
                     "numeric"                      "numeric" 
                          meal                        country 
                   "character"                    "character" 
                market_segment           distribution_channel 
                   "character"                    "character" 
             is_repeated_guest         previous_cancellations 
                     "numeric"                      "numeric" 
previous_bookings_not_canceled             reserved_room_type 
                     "numeric"                    "character" 
            assigned_room_type                booking_changes 
                   "character"                      "numeric" 
                  deposit_type                          agent 
                   "character"                    "character" 
                       company           days_in_waiting_list 
                   "character"                      "numeric" 
                 customer_type                            adr 
                   "character"                      "numeric" 
   required_car_parking_spaces      total_of_special_requests 
                     "numeric"                      "numeric" 
            reservation_status        reservation_status_date 
                   "character"                         "Date" 

We can now understand that there are 2 different fields called the agents and the company which have numerical values in them but has their class described as character. Therefore, these NULL values can be modified from NULL to NA and we can change the type of the column to numeric. We can also combine the arrival date in year, month and the date into one single field and name it as the arrival date.

Identify variables that need to be mutated

I have identified the various variables that can be mutated and have performed the mutation.

Code
#Mutating the arrival date into a single field and also mutate the adults, babies and children in order to get the total guests in the hotel.

hotels_booking_mutate <- hotels_booking %>% 
  mutate(arrival_date = str_c(arrival_date_day_of_month,
                              arrival_date_month,
                              arrival_date_year, sep="/"),
         arrival_date = dmy(arrival_date),
         total_guests = adults + children + babies) %>% 
  select(-c(arrival_date_day_of_month,arrival_date_month,arrival_date_year))

hotels_booking_mutate
# A tibble: 118,902 × 31
   hotel     is_ca…¹ lead_…² arriv…³ stays…⁴ stays…⁵ adults child…⁶ babies meal 
   <chr>       <dbl>   <dbl>   <dbl>   <dbl>   <dbl>  <dbl>   <dbl>  <dbl> <chr>
 1 Resort H…       0     342      27       0       0      2       0      0 BB   
 2 Resort H…       0     737      27       0       0      2       0      0 BB   
 3 Resort H…       0       7      27       0       1      1       0      0 BB   
 4 Resort H…       0      13      27       0       1      1       0      0 BB   
 5 Resort H…       0      14      27       0       2      2       0      0 BB   
 6 Resort H…       0      14      27       0       2      2       0      0 BB   
 7 Resort H…       0       0      27       0       2      2       0      0 BB   
 8 Resort H…       0       9      27       0       2      2       0      0 FB   
 9 Resort H…       1      85      27       0       3      2       0      0 BB   
10 Resort H…       1      75      27       0       3      2       0      0 HB   
# … with 118,892 more rows, 21 more variables: country <chr>,
#   market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
#   reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
#   deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
#   customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>,
#   total_of_special_requests <dbl>, reservation_status <chr>, …

After the mutation now lets find the range of the arrival date of the different bookings of the data

Code
summary(hotels_booking_mutate$arrival_date)
        Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
"2015-07-01" "2016-03-14" "2016-09-07" "2016-08-29" "2017-03-19" "2017-08-31" 

We can now observe that the arrival dates now lie between the July 2015 - August 2017. Also, the lead time will help us in understanding when the total hotel reservation has been started and we can also calculate the booking date by subtracting the lead time from the arrival date.

Code
hotels_booking_mutate <- hotels_booking_mutate %>%
  mutate(booking_date = arrival_date - lead_time)

hotels_booking_mutate
# A tibble: 118,902 × 32
   hotel     is_ca…¹ lead_…² arriv…³ stays…⁴ stays…⁵ adults child…⁶ babies meal 
   <chr>       <dbl>   <dbl>   <dbl>   <dbl>   <dbl>  <dbl>   <dbl>  <dbl> <chr>
 1 Resort H…       0     342      27       0       0      2       0      0 BB   
 2 Resort H…       0     737      27       0       0      2       0      0 BB   
 3 Resort H…       0       7      27       0       1      1       0      0 BB   
 4 Resort H…       0      13      27       0       1      1       0      0 BB   
 5 Resort H…       0      14      27       0       2      2       0      0 BB   
 6 Resort H…       0      14      27       0       2      2       0      0 BB   
 7 Resort H…       0       0      27       0       2      2       0      0 BB   
 8 Resort H…       0       9      27       0       2      2       0      0 FB   
 9 Resort H…       1      85      27       0       3      2       0      0 BB   
10 Resort H…       1      75      27       0       3      2       0      0 HB   
# … with 118,892 more rows, 22 more variables: country <chr>,
#   market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
#   reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
#   deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
#   customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>,
#   total_of_special_requests <dbl>, reservation_status <chr>, …
Code
unique(hotels_booking$reservation_status)
[1] "Check-Out" "Canceled"  "No-Show"  
Code
#We can find the number of days after booking when the status was changed to cancelled
hotel_booking_cancellation <- hotels_booking_mutate %>%
  filter(reservation_status == 'Canceled') %>%
  mutate(canc_time = booking_date - reservation_status_date)

hotel_booking_cancellation %>%
  summarise(min = min(canc_time), max = max(canc_time), mean = mean(canc_time), median = median(canc_time))
# A tibble: 1 × 4
  min       max    mean          median  
  <drtn>    <drtn> <drtn>        <drtn>  
1 -584 days 0 days -58.9916 days -26 days

Now we can change the class of the column of the agent and the company from character to numeric.

Code
#Mutating the class of the agent and company field from character to numeric

hotels_booking_mutate <- hotels_booking_mutate %>%
  mutate(across(c(agent, company),~ replace(.,str_detect(., "NULL"), NA))) %>% mutate_at(vars(agent, company),as.numeric)

is.numeric(hotels_booking_mutate$agent)
[1] TRUE
Code
is.numeric(hotels_booking_mutate$company)
[1] TRUE
Source Code
---
title: "Challenge 4"
author: "Neeharika Karanam"
desription: "More data wrangling: pivoting"
date: "12/02/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_4
  - abc_poll
  - eggs
  - fed_rates
  - hotel_bookings
  - debt
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)
library(lubridate)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

## Read in data

-   hotel_bookings.csv⭐⭐⭐⭐


```{r}
# Reading in the hotels booking data
hotels_booking <- read_csv("_data/hotel_bookings.csv", show_col_types = FALSE)

hotels_booking
```

In order to check the different unique values in all of the columns I have used the lubridate package.

```{r}
rapply(hotels_booking,function(x)length(unique(x)))
```

### Briefly describe the data

From the above dataset I have observed that the hotel bookings dataset has 119390 rows and 31 columns also consisting of data from mainly two different hotels called the City Hotel and the Resort Hotel. Each row in the dataset corresponds to the hotel bookings made by each of the customer and it also has a lot of additional information like the arrival date, stay days, meal, type of the room, customer, type, reservation status, number of the adult/children/babies in the room, previous cancellations, if there is a car parking space and lots more information and this data is from different parts of the world.

## Tidy Data (as needed)

I would like to tidy the data in the following way.

```{r}
table(hotels_booking$country)
```

From the above we can observe that there is a NULL value in the country column and we can remove this value because this information will not help us in futher analysis.

```{r}
hotels_booking <- hotels_booking %>% 
  filter(!(country == "NULL"))

```

```{r}
#To understand the class for each of the columns in the dataset.
sapply(hotels_booking, class)
```

We can now understand that there are 2 different fields called the agents and the company which have numerical values in them but has their class described as character. Therefore, these NULL values can be modified from NULL to NA and we can change the type of the column to numeric. We can also combine the arrival date in year, month and the date into one single field and name it as the arrival date.

## Identify variables that need to be mutated

I have identified the various variables that can be mutated and have performed the mutation.

```{r}
#Mutating the arrival date into a single field and also mutate the adults, babies and children in order to get the total guests in the hotel.

hotels_booking_mutate <- hotels_booking %>% 
  mutate(arrival_date = str_c(arrival_date_day_of_month,
                              arrival_date_month,
                              arrival_date_year, sep="/"),
         arrival_date = dmy(arrival_date),
         total_guests = adults + children + babies) %>% 
  select(-c(arrival_date_day_of_month,arrival_date_month,arrival_date_year))

hotels_booking_mutate
```
After the mutation now lets find the range of the arrival date of the different bookings of the data

```{r}
summary(hotels_booking_mutate$arrival_date)
```

We can now observe that the arrival dates now lie between the July 2015 - August 2017. Also, the lead time will help us in understanding when the total hotel reservation has been started and we can also calculate the booking date by subtracting the lead time from the arrival date.

```{r}
hotels_booking_mutate <- hotels_booking_mutate %>%
  mutate(booking_date = arrival_date - lead_time)

hotels_booking_mutate
```
```{r}
unique(hotels_booking$reservation_status)
```

```{r}
#We can find the number of days after booking when the status was changed to cancelled
hotel_booking_cancellation <- hotels_booking_mutate %>%
  filter(reservation_status == 'Canceled') %>%
  mutate(canc_time = booking_date - reservation_status_date)

hotel_booking_cancellation %>%
  summarise(min = min(canc_time), max = max(canc_time), mean = mean(canc_time), median = median(canc_time))
```

Now we can change the class of the column of the agent and the company from character to numeric.

```{r}
#Mutating the class of the agent and company field from character to numeric

hotels_booking_mutate <- hotels_booking_mutate %>%
  mutate(across(c(agent, company),~ replace(.,str_detect(., "NULL"), NA))) %>% mutate_at(vars(agent, company),as.numeric)

is.numeric(hotels_booking_mutate$agent)
```

```{r}
is.numeric(hotels_booking_mutate$company)
```