Challenge 4 Instructions

challenge_4

abc_poll

eggs

fed_rates

hotel_bookings

debt

Author

Meredith Rolfe

Published

August 18, 2022

Code

library(tidyverse)
library(lubridate)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
tidy data (as needed, including sanity checks)
identify variables that need to be mutated
mutate variables and sanity check all mutations

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

abc_poll.csv ⭐
poultry_tidy.xlsx or organiceggpoultry.xls⭐⭐
FedFundsRate.csv⭐⭐⭐
hotel_bookings.csv⭐⭐⭐⭐
debt_in_trillions.xlsx ⭐⭐⭐⭐⭐

Hotels booking data- (hotel_bookings.csv)

Code

hotel <- read_csv("_data/hotel_bookings.csv", show_col_types = FALSE)
hotel

# A tibble: 119,390 × 32
   hotel  is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
   <chr>    <dbl>   <dbl>   <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
 1 Resor…       0     342    2015 July         27       1       0       0      2
 2 Resor…       0     737    2015 July         27       1       0       0      2
 3 Resor…       0       7    2015 July         27       1       0       1      1
 4 Resor…       0      13    2015 July         27       1       0       1      1
 5 Resor…       0      14    2015 July         27       1       0       2      2
 6 Resor…       0      14    2015 July         27       1       0       2      2
 7 Resor…       0       0    2015 July         27       1       0       2      2
 8 Resor…       0       9    2015 July         27       1       0       2      2
 9 Resor…       1      85    2015 July         27       1       0       3      2
10 Resor…       1      75    2015 July         27       1       0       3      2
# … with 119,380 more rows, 22 more variables: children <dbl>, babies <dbl>,
#   meal <chr>, country <chr>, market_segment <chr>,
#   distribution_channel <chr>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
#   reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
#   deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
#   customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …

Briefly describe the data

Tidy Data (as needed)

Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.

Let’s use lubridate package, so we can check unique values in all the columns.

from the given result- 119,390 × 32, we can say dataset has 119390 rows and 31 columns so, each row in the data set belongs to the hotel bookings done by each customer and it describe- type, reservation status, arrival date, no. of the adult or children in the room, room type, customer, stay days, meal, previous cancellations.

Code

rapply(hotel,function(x)length(unique(x)))

                         hotel                    is_canceled 
                             2                              2 
                     lead_time              arrival_date_year 
                           479                              3 
            arrival_date_month       arrival_date_week_number 
                            12                             53 
     arrival_date_day_of_month        stays_in_weekend_nights 
                            31                             17 
          stays_in_week_nights                         adults 
                            35                             14 
                      children                         babies 
                             6                              5 
                          meal                        country 
                             5                            178 
                market_segment           distribution_channel 
                             8                              5 
             is_repeated_guest         previous_cancellations 
                             2                             15 
previous_bookings_not_canceled             reserved_room_type 
                            73                             10 
            assigned_room_type                booking_changes 
                            12                             21 
                  deposit_type                          agent 
                             3                            334 
                       company           days_in_waiting_list 
                           353                            128 
                 customer_type                            adr 
                             4                           8879 
   required_car_parking_spaces      total_of_special_requests 
                             5                              6 
            reservation_status        reservation_status_date 
                             3                            926

Any additional comments?

Identify variables that need to be mutated

Are there any variables that require mutation to be usable in your analysis stream? For example, are all time variables correctly coded as dates? Are all string variables reduced and cleaned to sensible categories? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?

Document your work here.

Code

table(hotel$country)


  ABW   AGO   AIA   ALB   AND   ARE   ARG   ARM   ASM   ATA   ATF   AUS   AUT 
    2   362     1    12     7    51   214     8     1     2     1   426  1263 
  AZE   BDI   BEL   BEN   BFA   BGD   BGR   BHR   BHS   BIH   BLR   BOL   BRA 
   17     1  2342     3     1    12    75     5     1    13    26    10  2224 
  BRB   BWA   CAF   CHE   CHL   CHN   CIV   CMR    CN   COL   COM   CPV   CRI 
    4     1     5  1730    65   999     6    10  1279    71     2    24    19 
  CUB   CYM   CYP   CZE   DEU   DJI   DMA   DNK   DOM   DZA   ECU   EGY   ESP 
    8     1    51   171  7287     1     1   435    14   103    27    32  8568 
  EST   ETH   FIN   FJI   FRA   FRO   GAB   GBR   GEO   GGY   GHA   GIB   GLP 
   83     3   447     1 10415     5     4 12129    22     3     4    18     2 
  GNB   GRC   GTM   GUY   HKG   HND   HRV   HUN   IDN   IMN   IND   IRL   IRN 
    9   128     4     1    29     1   100   230    35     2   152  3375    83 
  IRQ   ISL   ISR   ITA   JAM   JEY   JOR   JPN   KAZ   KEN   KHM   KIR   KNA 
   14    57   669  3766     6     8    21   197    19     6     2     1     2 
  KOR   KWT   LAO   LBN   LBY   LCA   LIE   LKA   LTU   LUX   LVA   MAC   MAR 
  133    16     2    31     8     1     3     7    81   287    55    16   259 
  MCO   MDG   MDV   MEX   MKD   MLI   MLT   MMR   MNE   MOZ   MRT   MUS   MWI 
    4     1    12    85    10     1    18     1     5    67     1     7     2 
  MYS   MYT   NAM   NCL   NGA   NIC   NLD   NOR   NPL  NULL   NZL   OMN   PAK 
   28     2     1     1    34     1  2104   607     1   488    74    18    14 
  PAN   PER   PHL   PLW   POL   PRI   PRT   PRY   PYF   QAT   ROU   RUS   RWA 
    9    29    40     1   919    12 48590     4     1    15   500   632     2 
  SAU   SDN   SEN   SGP   SLE   SLV   SMR   SRB   STP   SUR   SVK   SVN   SWE 
   48     1    11    39     1     2     1   101     2     5    65    57  1024 
  SYC   SYR   TGO   THA   TJK   TMP   TUN   TUR   TWN   TZA   UGA   UKR   UMI 
    2     3     2    59     9     3    39   248    51     5     2    68     1 
  URY   USA   UZB   VEN   VGB   VNM   ZAF   ZMB   ZWE 
   32  2097     4    26     1     8    80     2     4

We can see lot of NULL value in the country column, so we should remove this because it won’t help us.

Code

hotel <- hotel %>% 
  filter(!(country == "NULL"))

To see the class for each columns let’s do,

Code

sapply(hotel, class)

                         hotel                    is_canceled 
                   "character"                      "numeric" 
                     lead_time              arrival_date_year 
                     "numeric"                      "numeric" 
            arrival_date_month       arrival_date_week_number 
                   "character"                      "numeric" 
     arrival_date_day_of_month        stays_in_weekend_nights 
                     "numeric"                      "numeric" 
          stays_in_week_nights                         adults 
                     "numeric"                      "numeric" 
                      children                         babies 
                     "numeric"                      "numeric" 
                          meal                        country 
                   "character"                    "character" 
                market_segment           distribution_channel 
                   "character"                    "character" 
             is_repeated_guest         previous_cancellations 
                     "numeric"                      "numeric" 
previous_bookings_not_canceled             reserved_room_type 
                     "numeric"                    "character" 
            assigned_room_type                booking_changes 
                   "character"                      "numeric" 
                  deposit_type                          agent 
                   "character"                    "character" 
                       company           days_in_waiting_list 
                   "character"                      "numeric" 
                 customer_type                            adr 
                   "character"                      "numeric" 
   required_car_parking_spaces      total_of_special_requests 
                     "numeric"                      "numeric" 
            reservation_status        reservation_status_date 
                   "character"                         "Date"

Now we can mutate the arrival date into one single field and after that we can also mutate the adults and children so we can get the total guests in the hotel and then we will see the range of the arrival date of the bookings.

Code

hotel_mutate <- hotel %>% 
  mutate(arrival_date = str_c(arrival_date_day_of_month,
                              arrival_date_month,
                              arrival_date_year, sep="/"),
         arrival_date = dmy(arrival_date),
         total_guests = adults + children + babies) %>% 
  select(-c(arrival_date_day_of_month,arrival_date_month,arrival_date_year))
hotel_mutate

# A tibble: 118,902 × 31
   hotel     is_ca…¹ lead_…² arriv…³ stays…⁴ stays…⁵ adults child…⁶ babies meal 
   <chr>       <dbl>   <dbl>   <dbl>   <dbl>   <dbl>  <dbl>   <dbl>  <dbl> <chr>
 1 Resort H…       0     342      27       0       0      2       0      0 BB   
 2 Resort H…       0     737      27       0       0      2       0      0 BB   
 3 Resort H…       0       7      27       0       1      1       0      0 BB   
 4 Resort H…       0      13      27       0       1      1       0      0 BB   
 5 Resort H…       0      14      27       0       2      2       0      0 BB   
 6 Resort H…       0      14      27       0       2      2       0      0 BB   
 7 Resort H…       0       0      27       0       2      2       0      0 BB   
 8 Resort H…       0       9      27       0       2      2       0      0 FB   
 9 Resort H…       1      85      27       0       3      2       0      0 BB   
10 Resort H…       1      75      27       0       3      2       0      0 HB   
# … with 118,892 more rows, 21 more variables: country <chr>,
#   market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
#   reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
#   deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
#   customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>,
#   total_of_special_requests <dbl>, reservation_status <chr>, …

Code

summary(hotel_mutate$arrival_date)

        Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
"2015-07-01" "2016-03-14" "2016-09-07" "2016-08-29" "2017-03-19" "2017-08-31"

From the above information we can see that arrival dates now lie between 2015-07-01 and 2017-08-31.

Code

hotel_mutate <- hotel_mutate %>%
  mutate(booking_date = arrival_date - lead_time)
hotel_mutate

# A tibble: 118,902 × 32
   hotel     is_ca…¹ lead_…² arriv…³ stays…⁴ stays…⁵ adults child…⁶ babies meal 
   <chr>       <dbl>   <dbl>   <dbl>   <dbl>   <dbl>  <dbl>   <dbl>  <dbl> <chr>
 1 Resort H…       0     342      27       0       0      2       0      0 BB   
 2 Resort H…       0     737      27       0       0      2       0      0 BB   
 3 Resort H…       0       7      27       0       1      1       0      0 BB   
 4 Resort H…       0      13      27       0       1      1       0      0 BB   
 5 Resort H…       0      14      27       0       2      2       0      0 BB   
 6 Resort H…       0      14      27       0       2      2       0      0 BB   
 7 Resort H…       0       0      27       0       2      2       0      0 BB   
 8 Resort H…       0       9      27       0       2      2       0      0 FB   
 9 Resort H…       1      85      27       0       3      2       0      0 BB   
10 Resort H…       1      75      27       0       3      2       0      0 HB   
# … with 118,892 more rows, 22 more variables: country <chr>,
#   market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
#   reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
#   deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
#   customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>,
#   total_of_special_requests <dbl>, reservation_status <chr>, …

Now it adds a new column to the hotel_mutate data frame called booking_date, which is calculated by subtracting the lead_time column from the arrival_date column. This new column represents the date on which the hotel reservation was booked. After adding the new booking_date column, it returns the updated hotel_mutate data frame.

Code

unique(hotels_booking$reservation_status)

Error in unique(hotels_booking$reservation_status): object 'hotels_booking' not found

Now we will find that when the status was changed to cancelled after booking and also after that we will Mutate the class of the agent and company field to numeric.

Code

hotel_cancellations <- filter(hotel_mutate, reservation_status == 'Canceled')
hotel_cancellations <- mutate(hotel_cancellations, canc_time = booking_date - reservation_status_date)

summarise(hotel_cancellations, min = min(canc_time), max = max(canc_time), mean = mean(canc_time), median = median(canc_time))

# A tibble: 1 × 4
  min       max    mean          median  
  <drtn>    <drtn> <drtn>        <drtn>  
1 -584 days 0 days -58.9916 days -26 days

Here, we filters the hotel_mutate data frame to only include rows where the reservation_status column is equal to ‘Canceled’. Then, It adds a new column called canc_time to the hotel_cancellations data frame, which is calculated by subtracting the reservation_status_date column from the booking_date column and calculates summary statistics for the canc_time column, including the minimum, maximum, mean, and median values.

Code

hotel_mutate <- hotel_mutate %>%
  mutate(across(c(agent, company), ~ ifelse(str_detect(., "NULL"), NA, .))) %>%
  mutate_at(vars(agent, company), as.numeric)

is.numeric(hotel_mutate$agent)

[1] TRUE

First, the across function is used to replace all occurrences of the string "NULL" with NA in the agent and company columns.

Then, the mutate_at function is used to convert the agent and company columns to numeric data type.

Finally, the is.numeric function is used to check if the agent column has been successfully converted to numeric data type.

Now, we will check did it got change to numeric or not.

Code

is.numeric(hotel_mutate$company)

[1] TRUE

Debt in trillions data- (debt_in_trillions.xlsx)

Code

library(stringr)
library(tidyr)
library(dplyr)
library(ggplot2)

Let’s try a new data set which is debt in trillion

Code

library(readxl)
data <- read_excel("_data/debt_in_trillions.xlsx")
head(data)

# A tibble: 6 × 8
  `Year and Quarter` Mortgage `HE Revolving` Auto …¹ Credi…² Stude…³ Other Total
  <chr>                 <dbl>          <dbl>   <dbl>   <dbl>   <dbl> <dbl> <dbl>
1 03:Q1                  4.94          0.242   0.641   0.688   0.241 0.478  7.23
2 03:Q2                  5.08          0.26    0.622   0.693   0.243 0.486  7.38
3 03:Q3                  5.18          0.269   0.684   0.693   0.249 0.477  7.56
4 03:Q4                  5.66          0.302   0.704   0.698   0.253 0.449  8.07
5 04:Q1                  5.84          0.328   0.72    0.695   0.260 0.446  8.29
6 04:Q2                  5.97          0.367   0.743   0.697   0.263 0.423  8.46
# … with abbreviated variable names ¹`Auto Loan`, ²`Credit Card`,
#   ³`Student Loan`

Data shows the amount of debt in mortgage, HE revolving, auto loan, credit card, student loan, other, and total for each quarter from 2003 - 2021.

Code

str(data)

tibble [74 × 8] (S3: tbl_df/tbl/data.frame)
 $ Year and Quarter: chr [1:74] "03:Q1" "03:Q2" "03:Q3" "03:Q4" ...
 $ Mortgage        : num [1:74] 4.94 5.08 5.18 5.66 5.84 ...
 $ HE Revolving    : num [1:74] 0.242 0.26 0.269 0.302 0.328 0.367 0.426 0.468 0.502 0.528 ...
 $ Auto Loan       : num [1:74] 0.641 0.622 0.684 0.704 0.72 0.743 0.751 0.728 0.725 0.774 ...
 $ Credit Card     : num [1:74] 0.688 0.693 0.693 0.698 0.695 0.697 0.706 0.717 0.71 0.717 ...
 $ Student Loan    : num [1:74] 0.241 0.243 0.249 0.253 0.26 ...
 $ Other           : num [1:74] 0.478 0.486 0.477 0.449 0.447 ...
 $ Total           : num [1:74] 7.23 7.38 7.56 8.07 8.29 ...

Now we will try to create a column with only year and quarter, basically we will separate it.

Code

newdata <- data %>%
  separate(`Year and Quarter`, c("year", "quarter"))

newdata$year <- as.integer(newdata$year)

Now we will create a data which is calculated by year

Code

yeardata <- newdata %>%
  group_by(year)  %>%
  summarise_at (vars(Mortgage:Other), list(mean)) 
  
head(yeardata)

# A tibble: 6 × 7
   year Mortgage `HE Revolving` `Auto Loan` `Credit Card` `Student Loan` Other
  <int>    <dbl>          <dbl>       <dbl>         <dbl>          <dbl> <dbl>
1     3     5.22          0.268       0.663         0.693          0.246 0.472
2     4     6.09          0.397       0.736         0.704          0.300 0.426
3     5     6.80          0.534       0.780         0.724          0.377 0.404
4     6     7.87          0.595       0.806         0.746          0.450 0.422
5     7     8.79          0.626       0.808         0.804          0.524 0.412
6     8     9.26          0.685       0.804         0.853          0.604 0.410

Code

yeardata <- newdata %>%
  group_by(year)  %>%
  summarise(mean_mortgage = mean(Mortgage), mean_other = mean(Other))

head(yeardata)

# A tibble: 6 × 3
   year mean_mortgage mean_other
  <int>         <dbl>      <dbl>
1     3          5.22      0.472
2     4          6.09      0.426
3     5          6.80      0.404
4     6          7.87      0.422
5     7          8.79      0.412
6     8          9.26      0.410

This approach allows for more explicit column names in the resulting data frame.