DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 4 Instructions

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Challenge Overview
  • Read in data
    • Briefly describe the data
  • Tidy Data (as needed)
  • Identify variables that need to be mutated

Challenge 4 Instructions

  • Show All Code
  • Hide All Code

  • View Source
challenge_4
abc_poll
eggs
fed_rates
hotel_bookings
debt
Author

Matthew Norberg

Published

September 29, 2022

Code
library(tidyverse)
library(lubridate)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. tidy data (as needed, including sanity checks)
  3. identify variables that need to be mutated
  4. mutate variables and sanity check all mutations

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

  • abc_poll.csv ⭐
  • poultry_tidy.xlsx or organiceggpoultry.xls⭐⭐
  • FedFundsRate.csv⭐⭐⭐
  • hotel_bookings.csv⭐⭐⭐⭐
  • debt_in_trillions.xlsx ⭐⭐⭐⭐⭐
Code
# Read in data set and display values
df <- read_csv('./_data/hotel_bookings.csv')
df
# A tibble: 119,390 × 32
   hotel  is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
   <chr>    <dbl>   <dbl>   <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
 1 Resor…       0     342    2015 July         27       1       0       0      2
 2 Resor…       0     737    2015 July         27       1       0       0      2
 3 Resor…       0       7    2015 July         27       1       0       1      1
 4 Resor…       0      13    2015 July         27       1       0       1      1
 5 Resor…       0      14    2015 July         27       1       0       2      2
 6 Resor…       0      14    2015 July         27       1       0       2      2
 7 Resor…       0       0    2015 July         27       1       0       2      2
 8 Resor…       0       9    2015 July         27       1       0       2      2
 9 Resor…       1      85    2015 July         27       1       0       3      2
10 Resor…       1      75    2015 July         27       1       0       3      2
# … with 119,380 more rows, 22 more variables: children <dbl>, babies <dbl>,
#   meal <chr>, country <chr>, market_segment <chr>,
#   distribution_channel <chr>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
#   reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
#   deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
#   customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …
Code
# Print a summary of the data set to help describe it
summarytools::dfSummary(df)
Data Frame Summary  
df  
Dimensions: 119390 x 32  
Duplicates: 31994  

-----------------------------------------------------------------------------------------------------------------------------------
No   Variable                         Stats / Values             Freqs (% of Valid)     Graph                  Valid      Missing  
---- -------------------------------- -------------------------- ---------------------- ---------------------- ---------- ---------
1    hotel                            1. City Hotel              79330 (66.4%)          IIIIIIIIIIIII          119390     0        
     [character]                      2. Resort Hotel            40060 (33.6%)          IIIIII                 (100.0%)   (0.0%)   

2    is_canceled                      Min  : 0                   0 : 75166 (63.0%)      IIIIIIIIIIII           119390     0        
     [numeric]                        Mean : 0.4                 1 : 44224 (37.0%)      IIIIIII                (100.0%)   (0.0%)   
                                      Max  : 1                                                                                     

3    lead_time                        Mean (sd) : 104 (106.9)    479 distinct values    :                      119390     0        
     [numeric]                        min < med < max:                                  :                      (100.0%)   (0.0%)   
                                      0 < 69 < 737                                      :                                          
                                      IQR (CV) : 142 (1)                                : : .                                      
                                                                                        : : : . .                                  

4    arrival_date_year                Mean (sd) : 2016.2 (0.7)   2015 : 21996 (18.4%)   III                    119390     0        
     [numeric]                        min < med < max:           2016 : 56707 (47.5%)   IIIIIIIII              (100.0%)   (0.0%)   
                                      2015 < 2016 < 2017         2017 : 40687 (34.1%)   IIIIII                                     
                                      IQR (CV) : 1 (0)                                                                             

5    arrival_date_month               1. August                  13877 (11.6%)          II                     119390     0        
     [character]                      2. July                    12661 (10.6%)          II                     (100.0%)   (0.0%)   
                                      3. May                     11791 ( 9.9%)          I                                          
                                      4. October                 11160 ( 9.3%)          I                                          
                                      5. April                   11089 ( 9.3%)          I                                          
                                      6. June                    10939 ( 9.2%)          I                                          
                                      7. September               10508 ( 8.8%)          I                                          
                                      8. March                    9794 ( 8.2%)          I                                          
                                      9. February                 8068 ( 6.8%)          I                                          
                                      10. November                6794 ( 5.7%)          I                                          
                                      [ 2 others ]               12709 (10.6%)          II                                         

6    arrival_date_week_number         Mean (sd) : 27.2 (13.6)    53 distinct values           . : . . .        119390     0        
     [numeric]                        min < med < max:                                    . : : : : : :        (100.0%)   (0.0%)   
                                      1 < 28 < 53                                       . : : : : : : : : :                        
                                      IQR (CV) : 22 (0.5)                               : : : : : : : : : :                        
                                                                                        : : : : : : : : : :                        

7    arrival_date_day_of_month        Mean (sd) : 15.8 (8.8)     31 distinct values     :                      119390     0        
     [numeric]                        min < med < max:                                  : : : . : : . : :      (100.0%)   (0.0%)   
                                      1 < 16 < 31                                       : : : : : : : : : :                        
                                      IQR (CV) : 15 (0.6)                               : : : : : : : : : :                        
                                                                                        : : : : : : : : : :                        

8    stays_in_weekend_nights          Mean (sd) : 0.9 (1)        17 distinct values     :                      119390     0        
     [numeric]                        min < med < max:                                  :                      (100.0%)   (0.0%)   
                                      0 < 1 < 19                                        :                                          
                                      IQR (CV) : 2 (1.1)                                : :                                        
                                                                                        : :                                        

9    stays_in_week_nights             Mean (sd) : 2.5 (1.9)      35 distinct values     :                      119390     0        
     [numeric]                        min < med < max:                                  :                      (100.0%)   (0.0%)   
                                      0 < 2 < 50                                        :                                          
                                      IQR (CV) : 2 (0.8)                                :                                          
                                                                                        :                                          

10   adults                           Mean (sd) : 1.9 (0.6)      14 distinct values     :                      119390     0        
     [numeric]                        min < med < max:                                  :                      (100.0%)   (0.0%)   
                                      0 < 2 < 55                                        :                                          
                                      IQR (CV) : 0 (0.3)                                :                                          
                                                                                        :                                          

11   children                         Mean (sd) : 0.1 (0.4)      0 : 110796 (92.8%)     IIIIIIIIIIIIIIIIII     119386     4        
     [numeric]                        min < med < max:           1 :   4861 ( 4.1%)                            (100.0%)   (0.0%)   
                                      0 < 0 < 10                 2 :   3652 ( 3.1%)                                                
                                      IQR (CV) : 0 (3.8)         3 :     76 ( 0.1%)                                                
                                                                 10 :      1 ( 0.0%)                                               

12   babies                           Mean (sd) : 0 (0.1)        0 : 118473 (99.2%)     IIIIIIIIIIIIIIIIIII    119390     0        
     [numeric]                        min < med < max:           1 :    900 ( 0.8%)                            (100.0%)   (0.0%)   
                                      0 < 0 < 10                 2 :     15 ( 0.0%)                                                
                                      IQR (CV) : 0 (12.3)        9 :      1 ( 0.0%)                                                
                                                                 10 :      1 ( 0.0%)                                               

13   meal                             1. BB                      92310 (77.3%)          IIIIIIIIIIIIIII        119390     0        
     [character]                      2. FB                        798 ( 0.7%)                                 (100.0%)   (0.0%)   
                                      3. HB                      14463 (12.1%)          II                                         
                                      4. SC                      10650 ( 8.9%)          I                                          
                                      5. Undefined                1169 ( 1.0%)                                                     

14   country                          1. PRT                     48590 (40.7%)          IIIIIIII               119390     0        
     [character]                      2. GBR                     12129 (10.2%)          II                     (100.0%)   (0.0%)   
                                      3. FRA                     10415 ( 8.7%)          I                                          
                                      4. ESP                      8568 ( 7.2%)          I                                          
                                      5. DEU                      7287 ( 6.1%)          I                                          
                                      6. ITA                      3766 ( 3.2%)                                                     
                                      7. IRL                      3375 ( 2.8%)                                                     
                                      8. BEL                      2342 ( 2.0%)                                                     
                                      9. BRA                      2224 ( 1.9%)                                                     
                                      10. NLD                     2104 ( 1.8%)                                                     
                                      [ 168 others ]             18590 (15.6%)          III                                        

15   market_segment                   1. Aviation                  237 ( 0.2%)                                 119390     0        
     [character]                      2. Complementary             743 ( 0.6%)                                 (100.0%)   (0.0%)   
                                      3. Corporate                5295 ( 4.4%)                                                     
                                      4. Direct                  12606 (10.6%)          II                                         
                                      5. Groups                  19811 (16.6%)          III                                        
                                      6. Offline TA/TO           24219 (20.3%)          IIII                                       
                                      7. Online TA               56477 (47.3%)          IIIIIIIII                                  
                                      8. Undefined                   2 ( 0.0%)                                                     

16   distribution_channel             1. Corporate                6677 ( 5.6%)          I                      119390     0        
     [character]                      2. Direct                  14645 (12.3%)          II                     (100.0%)   (0.0%)   
                                      3. GDS                       193 ( 0.2%)                                                     
                                      4. TA/TO                   97870 (82.0%)          IIIIIIIIIIIIIIII                           
                                      5. Undefined                   5 ( 0.0%)                                                     

17   is_repeated_guest                Min  : 0                   0 : 115580 (96.8%)     IIIIIIIIIIIIIIIIIII    119390     0        
     [numeric]                        Mean : 0                   1 :   3810 ( 3.2%)                            (100.0%)   (0.0%)   
                                      Max  : 1                                                                                     

18   previous_cancellations           Mean (sd) : 0.1 (0.8)      15 distinct values     :                      119390     0        
     [numeric]                        min < med < max:                                  :                      (100.0%)   (0.0%)   
                                      0 < 0 < 26                                        :                                          
                                      IQR (CV) : 0 (9.7)                                :                                          
                                                                                        :                                          

19   previous_bookings_not_canceled   Mean (sd) : 0.1 (1.5)      73 distinct values     :                      119390     0        
     [numeric]                        min < med < max:                                  :                      (100.0%)   (0.0%)   
                                      0 < 0 < 72                                        :                                          
                                      IQR (CV) : 0 (10.9)                               :                                          
                                                                                        :                                          

20   reserved_room_type               1. A                       85994 (72.0%)          IIIIIIIIIIIIII         119390     0        
     [character]                      2. B                        1118 ( 0.9%)                                 (100.0%)   (0.0%)   
                                      3. C                         932 ( 0.8%)                                                     
                                      4. D                       19201 (16.1%)          III                                        
                                      5. E                        6535 ( 5.5%)          I                                          
                                      6. F                        2897 ( 2.4%)                                                     
                                      7. G                        2094 ( 1.8%)                                                     
                                      8. H                         601 ( 0.5%)                                                     
                                      9. L                           6 ( 0.0%)                                                     
                                      10. P                         12 ( 0.0%)                                                     

21   assigned_room_type               1. A                       74053 (62.0%)          IIIIIIIIIIII           119390     0        
     [character]                      2. D                       25322 (21.2%)          IIII                   (100.0%)   (0.0%)   
                                      3. E                        7806 ( 6.5%)          I                                          
                                      4. F                        3751 ( 3.1%)                                                     
                                      5. G                        2553 ( 2.1%)                                                     
                                      6. C                        2375 ( 2.0%)                                                     
                                      7. B                        2163 ( 1.8%)                                                     
                                      8. H                         712 ( 0.6%)                                                     
                                      9. I                         363 ( 0.3%)                                                     
                                      10. K                        279 ( 0.2%)                                                     
                                      [ 2 others ]                  13 ( 0.0%)                                                     

22   booking_changes                  Mean (sd) : 0.2 (0.7)      21 distinct values     :                      119390     0        
     [numeric]                        min < med < max:                                  :                      (100.0%)   (0.0%)   
                                      0 < 0 < 21                                        :                                          
                                      IQR (CV) : 0 (2.9)                                :                                          
                                                                                        :                                          

23   deposit_type                     1. No Deposit              104641 (87.6%)         IIIIIIIIIIIIIIIII      119390     0        
     [character]                      2. Non Refund               14587 (12.2%)         II                     (100.0%)   (0.0%)   
                                      3. Refundable                 162 ( 0.1%)                                                    

24   agent                            1. 9                       31961 (26.8%)          IIIII                  119390     0        
     [character]                      2. NULL                    16340 (13.7%)          II                     (100.0%)   (0.0%)   
                                      3. 240                     13922 (11.7%)          II                                         
                                      4. 1                        7191 ( 6.0%)          I                                          
                                      5. 14                       3640 ( 3.0%)                                                     
                                      6. 7                        3539 ( 3.0%)                                                     
                                      7. 6                        3290 ( 2.8%)                                                     
                                      8. 250                      2870 ( 2.4%)                                                     
                                      9. 241                      1721 ( 1.4%)                                                     
                                      10. 28                      1666 ( 1.4%)                                                     
                                      [ 324 others ]             33250 (27.8%)          IIIII                                      

25   company                          1. NULL                    112593 (94.3%)         IIIIIIIIIIIIIIIIII     119390     0        
     [character]                      2. 40                         927 ( 0.8%)                                (100.0%)   (0.0%)   
                                      3. 223                        784 ( 0.7%)                                                    
                                      4. 67                         267 ( 0.2%)                                                    
                                      5. 45                         250 ( 0.2%)                                                    
                                      6. 153                        215 ( 0.2%)                                                    
                                      7. 174                        149 ( 0.1%)                                                    
                                      8. 219                        141 ( 0.1%)                                                    
                                      9. 281                        138 ( 0.1%)                                                    
                                      10. 154                       133 ( 0.1%)                                                    
                                      [ 343 others ]               3793 ( 3.2%)                                                    

26   days_in_waiting_list             Mean (sd) : 2.3 (17.6)     128 distinct values    :                      119390     0        
     [numeric]                        min < med < max:                                  :                      (100.0%)   (0.0%)   
                                      0 < 0 < 391                                       :                                          
                                      IQR (CV) : 0 (7.6)                                :                                          
                                                                                        :                                          

27   customer_type                    1. Contract                 4076 ( 3.4%)                                 119390     0        
     [character]                      2. Group                     577 ( 0.5%)                                 (100.0%)   (0.0%)   
                                      3. Transient               89613 (75.1%)          IIIIIIIIIIIIIII                            
                                      4. Transient-Party         25124 (21.0%)          IIII                                       

28   adr                              Mean (sd) : 101.8 (50.5)   8879 distinct values   :                      119390     0        
     [numeric]                        min < med < max:                                  :                      (100.0%)   (0.0%)   
                                      -6.4 < 94.6 < 5400                                :                                          
                                      IQR (CV) : 56.7 (0.5)                             :                                          
                                                                                        :                                          

29   required_car_parking_spaces      Mean (sd) : 0.1 (0.2)      0 : 111974 (93.8%)     IIIIIIIIIIIIIIIIII     119390     0        
     [numeric]                        min < med < max:           1 :   7383 ( 6.2%)     I                      (100.0%)   (0.0%)   
                                      0 < 0 < 8                  2 :     28 ( 0.0%)                                                
                                      IQR (CV) : 0 (3.9)         3 :      3 ( 0.0%)                                                
                                                                 8 :      2 ( 0.0%)                                                

30   total_of_special_requests        Mean (sd) : 0.6 (0.8)      0 : 70318 (58.9%)      IIIIIIIIIII            119390     0        
     [numeric]                        min < med < max:           1 : 33226 (27.8%)      IIIII                  (100.0%)   (0.0%)   
                                      0 < 0 < 5                  2 : 12969 (10.9%)      II                                         
                                      IQR (CV) : 1 (1.4)         3 :  2497 ( 2.1%)                                                 
                                                                 4 :   340 ( 0.3%)                                                 
                                                                 5 :    40 ( 0.0%)                                                 

31   reservation_status               1. Canceled                43017 (36.0%)          IIIIIII                119390     0        
     [character]                      2. Check-Out               75166 (63.0%)          IIIIIIIIIIII           (100.0%)   (0.0%)   
                                      3. No-Show                  1207 ( 1.0%)                                                     

32   reservation_status_date          min : 2014-10-17           926 distinct values            . : : : :      119390     0        
     [Date]                           med : 2016-08-07                                        : : : : : : .    (100.0%)   (0.0%)   
                                      max : 2017-09-14                                      . : : : : : : :                        
                                      range : 2y 10m 28d                                    : : : : : : : :                        
                                                                                        .   : : : : : : : :                        
-----------------------------------------------------------------------------------------------------------------------------------

Briefly describe the data

Each row in the data set describes a hotel booking for a hotel chain. We can see that from the ‘hotel’ column that each hotel in the chain is classified as either a ‘City Hotel’ or a ‘Resort Hotel’. Looking at the ‘arrival_date_year’ column, we can see that the data was collected between 2015 and 2017. Additionally, we can see from the ‘country’ column that the data was collected from all over the world. In total, we can see that there are 178 different values in the ‘country’ column. However, one of those values is “NULL”, so there are only 177 known countries included in the data set.

Tidy Data (as needed)

Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.

There are four rows in the data set that contain values that are not available. They are all in the ‘children’ column. I believe it is safe to remove these values since it is only 4 cases out of 119,390.

Code
# Remove NA values
df <- na.omit(df)

The data is tidy after removing the values which are not available. Every row represents an observation, each column represents one variable, and each value has its own cell.

Any additional comments?

Looking at the summary of the data frame, we can see that the ‘agent’ and ‘company’ columns are of type character. However, upon inspection, it appears as though they are all integers. We can inspect the two values in the two columns with the following R commands:

Code
# List distinct values in agent and company
distinct(df, agent)
# A tibble: 334 × 1
   agent
   <chr>
 1 NULL 
 2 304  
 3 240  
 4 303  
 5 15   
 6 241  
 7 8    
 8 250  
 9 115  
10 5    
# … with 324 more rows
Code
distinct(df, company)
# A tibble: 353 × 1
   company
   <chr>  
 1 NULL   
 2 110    
 3 113    
 4 270    
 5 178    
 6 240    
 7 154    
 8 144    
 9 307    
10 268    
# … with 343 more rows

By inspection, we can see that the columns are of character type because of string “NULL” in each column. This was likely placed in the data set to indicate that an agent or company was not used. However, it is a bit annoying that the columns are character types. For instance, this prevents you from sorting the columns by value. Here is an example:

Code
# Sort distinct values in agent column
arrange(distinct(df, agent), agent)
# A tibble: 334 × 1
   agent
   <chr>
 1 1    
 2 10   
 3 103  
 4 104  
 5 105  
 6 106  
 7 107  
 8 11   
 9 110  
10 111  
# … with 324 more rows

We can see in the tibble above, that 107 comes before 11. That isn’t quite right and it is only happening because the column is of type character. If we look at the column a bit further, we can see that none of the values are less than 0. These queries confirm that this is the case.

Code
# Find agent and company values that have integer value less than 0
distinct(df, agent) %>% filter(agent != "NULL") %>% filter(as.integer(agent) <= 0)
# A tibble: 0 × 1
# … with 1 variable: agent <chr>
Code
distinct(df, company) %>% filter(company != "NULL") %>% filter(as.integer(company) <= 0)
# A tibble: 0 × 1
# … with 1 variable: company <chr>

To change the type of the column from character to double, I think we should replace all of the “NULL” strings with “-1”. Then we can convert the type of the column. We will just use the convention that a value of -1 in either column indicates that an agent or company was not used.

Code
# Convert agent and company columns to double type
# Note: Replace "NULL" with "-1" first to do so
df <- df %>%
      mutate(agent = str_replace(agent, "NULL", "-1")) %>%
      mutate(company = str_replace(company, "NULL", "-1")) %>%
      type_convert()

Now, the ‘agent’ and ‘company’ columns have been converted to double.

Identify variables that need to be mutated

Are there any variables that require mutation to be usable in your analysis stream? For example, are all time variables correctly coded as dates? Are all string variables reduced and cleaned to sensible categories? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?

Document your work here.

First, I think that the ‘arrival_date_week_number’ is not useful. We have the arrival month, year, and the day of the month. I am not sure why we need the ‘arrival_date_week_number’ as well. Also, certain dates do not always fall on the same date of the year. For example, this year, January 1st was on a Friday and last year it was on a Saturday. Consequently, the week of the year that someone arrives may not fall on the same date between years. Lastly, I am not sure how they determine what the first week of the year is. In the previous example, is the first week the span of Friday, Saturday, and Sunday because the first is on a Friday? I don’t know, but the first week could also be considered as the following week in the example because it is the first full week. Due to these details, I think it’s best to drop that column from the data set.

Code
# Remove arrival_date_week_number
df <- df %>% select(-(arrival_date_week_number))

The next observation I had is that the ‘adr’ column stands for average daily rate. I think it would be good to rename this column to make the data more readable.

Code
# Rename adr column
df <- df %>% rename(average_daily_rate = adr)

Looking at the columns, it appears as though many of the columns represent categorical data. These can all be converted to factors to make it easier to understand how many times each category appears in the data set.

Code
# Create factors for categorical data

month_levels <- c(
  "January", "February", "March", "April", "May", "June",
  "July", "August", "September", "October", "November", "December"
)
df$arrival_date_month <- factor(df$arrival_date_month, month_levels)

df$meal <- factor(df$meal, c("BB", "HB", "FB", "SC", "Undefined"))

df$hotel <- factor(df$hotel, c("City Hotel", "Resort Hotel"))

df$market_segment <- factor(df$market_segment, c("Aviation", "Complementary", "Corporate", "Direct", "Groups",
                                                 "Offline TA/TO", "Online TA"))

df$distribution_channel <- factor(df$distribution_channel, c("Corporate", "Direct", "GDS", "TA/TO", "Undefined"))

df$assigned_room_type <- factor(df$assigned_room_type, unique(df$assigned_room_type))

df$reserved_room_type <- factor(df$reserved_room_type, unique(df$reserved_room_type))

df$deposit_type <- factor(df$deposit_type, unique(df$deposit_type))

df$customer_type <- factor(df$customer_type, unique(df$customer_type))

df$reservation_status <- factor(df$reservation_status, unique(df$reservation_status))

The next observation is that we are recording the arrival date in three columns, instead of one. We can use mutate to combine the information in these columns. Afterwards, we can remove the old date columns to simplify the data set.

Code
# Convert arrival information into single column and remove old columns
df <- df %>% mutate(date = make_date(arrival_date_year, arrival_date_month, arrival_date_day_of_month))
df <- df %>% relocate(date, .after = hotel)
df <- df %>% select(-c(arrival_date_year, arrival_date_month, arrival_date_day_of_month))

Next, we can see that the number of nights that the booking was scheduled for is recorded in two columns. One records the weeknights and the other records the weekdays. I think it would be a good idea to create a new column called total nights using mutate. This is done below.

Code
# Create a column to store total number of nights stayed
df <- df %>% mutate(total_nights = stays_in_weekend_nights + stays_in_week_nights) %>%
      relocate(total_nights, .after = stays_in_week_nights)

Now that we have tidied up the tibble, we need to present it in a way that is readable. I think the best way to do that is to sort it by arrival date.

Code
# Sort by date
df <- df %>% arrange(date)

Any additional comments?

Source Code
---
title: "Challenge 4 Instructions"
author: "Matthew Norberg"
desription: "More data wrangling: pivoting"
date: "09/29/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_4
  - abc_poll
  - eggs
  - fed_rates
  - hotel_bookings
  - debt
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)
library(lubridate)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

## Challenge Overview

Today's challenge is to:

1)  read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
2)  tidy data (as needed, including sanity checks)
3)  identify variables that need to be mutated
4)  mutate variables and sanity check all mutations

## Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

-   abc_poll.csv ⭐
-   poultry_tidy.xlsx or organiceggpoultry.xls⭐⭐
-   FedFundsRate.csv⭐⭐⭐
-   hotel_bookings.csv⭐⭐⭐⭐
-   debt_in_trillions.xlsx ⭐⭐⭐⭐⭐

```{r}
# Read in data set and display values
df <- read_csv('./_data/hotel_bookings.csv')
df
# Print a summary of the data set to help describe it
summarytools::dfSummary(df)
```

### Briefly describe the data

Each row in the data set describes a hotel booking for a hotel chain.  We can see that from the 'hotel' column that each hotel in the chain is classified as either a 'City Hotel' or a 'Resort Hotel'.  Looking at the 'arrival_date_year' column, we can see that the data was collected between 2015 and 2017.  Additionally, we can see from the 'country' column that the data was collected from all over the world.  In total, we can see that there are 178 different values in the 'country' column.  However, one of those values is "NULL", so there are only 177 known countries included in the data set. 

## Tidy Data (as needed)

Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.

There are four rows in the data set that contain values that are not available.  They are all in the 'children' column.  I believe it is safe to remove these values since it is only 4 cases out of 119,390.

```{r}
# Remove NA values
df <- na.omit(df)
```

The data is tidy after removing the values which are not available.  Every row represents an observation, each column represents one variable, and each value has its own cell.

Any additional comments?

Looking at the summary of the data frame, we can see that the 'agent' and 'company' columns are of type character.  However, upon inspection, it appears as though they are all integers.  We can inspect the two values in the two columns with the following R commands:

```{r}
# List distinct values in agent and company
distinct(df, agent)
distinct(df, company)
```

By inspection, we can see that the columns are of character type because of string "NULL" in each column.  This was likely placed in the data set to indicate that an agent or company was not used.  However, it is a bit annoying that the columns are character types.  For instance, this prevents you from sorting the columns by value.  Here is an example:

```{r}
# Sort distinct values in agent column
arrange(distinct(df, agent), agent)
```

We can see in the tibble above, that 107 comes before 11.  That isn't quite right and it is only happening because the column is of type character.  If we look at the column a bit further, we can see that none of the values are less than 0.  These queries confirm that this is the case.

```{r}
# Find agent and company values that have integer value less than 0
distinct(df, agent) %>% filter(agent != "NULL") %>% filter(as.integer(agent) <= 0)
distinct(df, company) %>% filter(company != "NULL") %>% filter(as.integer(company) <= 0)
```

To change the type of the column from character to double, I think we should replace all of the "NULL" strings with "-1".  Then we can convert the type of the column.  We will just use the convention that a value of -1 in either column indicates that an agent or company was not used.  

```{r}
# Convert agent and company columns to double type
# Note: Replace "NULL" with "-1" first to do so
df <- df %>%
      mutate(agent = str_replace(agent, "NULL", "-1")) %>%
      mutate(company = str_replace(company, "NULL", "-1")) %>%
      type_convert()
```

Now, the 'agent' and 'company' columns have been converted to double.

## Identify variables that need to be mutated

Are there any variables that require mutation to be usable in your analysis stream? For example, are all time variables correctly coded as dates? Are all string variables reduced and cleaned to sensible categories? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?

Document your work here.

First, I think that the 'arrival_date_week_number' is not useful.  We have the arrival month, year, and the day of the month.  I am not sure why we need the 'arrival_date_week_number' as well.  Also, certain dates do not always fall on the same date of the year.  For example, this year, January 1st was on a Friday and last year it was on a Saturday.  Consequently, the week of the year that someone arrives may not fall on the same date between years.  Lastly, I am not sure how they determine what the first week of the year is.  In the previous example, is the first week the span of Friday, Saturday, and Sunday because the first is on a Friday?  I don't know, but the first week could also be considered as the following week in the example because it is the first full week.  Due to these details, I think it's best to drop that column from the data set. 

```{r}
# Remove arrival_date_week_number
df <- df %>% select(-(arrival_date_week_number))
```

The next observation I had is that the 'adr' column stands for average daily rate.  I think it would be good to rename this column to make the data more readable. 

```{r}
# Rename adr column
df <- df %>% rename(average_daily_rate = adr)
```

Looking at the columns, it appears as though many of the columns represent categorical data.  These can all be converted to factors to make it easier to understand how many times each category appears in the data set.

```{r}
# Create factors for categorical data

month_levels <- c(
  "January", "February", "March", "April", "May", "June",
  "July", "August", "September", "October", "November", "December"
)
df$arrival_date_month <- factor(df$arrival_date_month, month_levels)

df$meal <- factor(df$meal, c("BB", "HB", "FB", "SC", "Undefined"))

df$hotel <- factor(df$hotel, c("City Hotel", "Resort Hotel"))

df$market_segment <- factor(df$market_segment, c("Aviation", "Complementary", "Corporate", "Direct", "Groups",
                                                 "Offline TA/TO", "Online TA"))

df$distribution_channel <- factor(df$distribution_channel, c("Corporate", "Direct", "GDS", "TA/TO", "Undefined"))

df$assigned_room_type <- factor(df$assigned_room_type, unique(df$assigned_room_type))

df$reserved_room_type <- factor(df$reserved_room_type, unique(df$reserved_room_type))

df$deposit_type <- factor(df$deposit_type, unique(df$deposit_type))

df$customer_type <- factor(df$customer_type, unique(df$customer_type))

df$reservation_status <- factor(df$reservation_status, unique(df$reservation_status))
```

The next observation is that we are recording the arrival date in three columns, instead of one.  We can use mutate to combine the information in these columns.  Afterwards, we can remove the old date columns to simplify the data set.

```{r}
# Convert arrival information into single column and remove old columns
df <- df %>% mutate(date = make_date(arrival_date_year, arrival_date_month, arrival_date_day_of_month))
df <- df %>% relocate(date, .after = hotel)
df <- df %>% select(-c(arrival_date_year, arrival_date_month, arrival_date_day_of_month))
```

Next, we can see that the number of nights that the booking was scheduled for is recorded in two columns.  One records the weeknights and the other records the weekdays.  I think it would be a good idea to create  a new column called total nights using mutate.  This is done below.

```{r}
# Create a column to store total number of nights stayed
df <- df %>% mutate(total_nights = stays_in_weekend_nights + stays_in_week_nights) %>%
      relocate(total_nights, .after = stays_in_week_nights)
```

Now that we have tidied up the tibble, we need to present it in a way that is readable.  I think the best way to do that is to sort it by arrival date. 

```{r}
# Sort by date
df <- df %>% arrange(date)
```

Any additional comments?