DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 4

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Challenge Overview
  • Read in data
    • Briefly describe the data
  • Tidy Data (as needed)
  • Identify variables that need to be mutated

Challenge 4

  • Show All Code
  • Hide All Code

  • View Source
challenge_4
abc_poll
eggs
fed_rates
hotel_bookings
debt
Author

Janhvi Joshi

Published

November 10, 2022

Code
library(tidyverse)
library(lubridate)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. tidy data (as needed, including sanity checks)
  3. identify variables that need to be mutated
  4. mutate variables and sanity check all mutations

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

  • abc_poll.csv ⭐
  • poultry_tidy.xlsx or organiceggpoultry.xls⭐⭐
  • FedFundsRate.csv⭐⭐⭐
  • hotel_bookings.csv⭐⭐⭐⭐
  • debt_in_trillions.xlsx ⭐⭐⭐⭐⭐
Code
hotels_booking <- read_csv("_data/hotel_bookings.csv")
hotels_booking
# A tibble: 119,390 × 32
   hotel  is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
   <chr>    <dbl>   <dbl>   <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
 1 Resor…       0     342    2015 July         27       1       0       0      2
 2 Resor…       0     737    2015 July         27       1       0       0      2
 3 Resor…       0       7    2015 July         27       1       0       1      1
 4 Resor…       0      13    2015 July         27       1       0       1      1
 5 Resor…       0      14    2015 July         27       1       0       2      2
 6 Resor…       0      14    2015 July         27       1       0       2      2
 7 Resor…       0       0    2015 July         27       1       0       2      2
 8 Resor…       0       9    2015 July         27       1       0       2      2
 9 Resor…       1      85    2015 July         27       1       0       3      2
10 Resor…       1      75    2015 July         27       1       0       3      2
# … with 119,380 more rows, 22 more variables: children <dbl>, babies <dbl>,
#   meal <chr>, country <chr>, market_segment <chr>,
#   distribution_channel <chr>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
#   reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
#   deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
#   customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …

Briefly describe the data

This dataset summarises various details about a hotel booking and contains 120k records and 32 columns dating from year 2015 to 2017. There are two types of hotels in this dataset - Resort Hotel and City Hotel. Customers from around the world; around 160-170 countries, book these hotels. It can also be seen that on an average, around 37% of the bookings are cancelled and around 3% of the guests are repeated. On an average, customers may need to wait for 2.3 days in the waitlist to confirm a booking and around 57% of these bookings include some special requests. The hotels provide 4 different types of meals to their customers. This dataset is likely gathered from various online and offline channels - from where the booking was made.

Tidy Data (as needed)

Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.

I looked at multiple different columns and found a few that have many NULL values as shown below. So I will be removing rows where value of country is NULL since that record does not have much significance. The company columns also has many NULL values but since it is numerical, I will change it to NA. Additionally, I am not sure what the adr column depicts and will be removing it too since it is not useful for my analysis.

Code
table(hotels_booking$company)

    10    100    101    102    103    104    105    106    107    108    109 
     1      1      1      1     16      1      8      2      9     11      1 
    11    110    112    113    115    116    118     12    120    122    126 
     1     52     13     36      4      6      7     14     14     18      1 
   127    130    132    135    137    139     14    140    142    143    144 
    15     12      1     66      4      3      9      1      1     17     27 
   146    148    149    150    153    154    158    159     16    160    163 
     3     37      5     19    215    133      2      6      5      1     17 
   165    167    168    169    174    178    179     18    180    183    184 
     3      7      2     65    149     27     24      1      5     16      1 
   185    186    192    193    195    197     20    200    202    203    204 
     4     12      4     16     38     47     50      3     38     13     34 
   207    209    210    212    213    215    216    217    218    219     22 
     9     19      2      1      1      8     21      2     43    141      6 
   220    221    222    223    224    225    227    229    230    232    233 
     4     27      2    784      3      7     24      1      3      2    114 
   234    237    238    240    242    243    245    246    250    251    253 
     1      1     33      3     62      2      3      3      2     18      1 
   254    255    257    258    259    260    263    264    268    269    270 
    10      6      1      1      2      3     14      2     14     33     43 
   271    272    273    274    275    277    278    279     28    280    281 
     2      3      1     14      3      5      2      8      5     48    138 
   282    284    286    287    288    289     29    290    291    292    293 
     4      1     21      5      1      2      2     17     12     18      5 
   297    301    302    304    305    307    308    309     31    311    312 
     7      1      5      2      1     36     33      1     17      2      3 
   313    314    316    317    318    319     32    320    321    323    324 
     1      1      2      9      1      3      1      1      2     10      9 
   325    329    330    331    332    333    334    337    338     34    341 
     2     12      4     61      2     11      3     25     12      8      5 
   342    343    346    347    348    349     35    350    351    352    353 
    48     29     14      1     59      2      1      3      2      1      4 
   355    356    357    358    360    361    362    364    365    366    367 
    13     10      5      7     12      2      2      6     29     24     14 
   368    369     37    370    371    372    373    376    377    378    379 
     1      5     10      2     11      3      1      1      5      3      9 
    38    380    382    383    384    385    386    388     39    390    391 
    51     12      5      6      9     30      1      7      8     13      2 
   392    393    394    395    396    397    398    399     40    400    401 
     4      1      6      4     18     15      1     11    927      2      1 
   402    403    405    407    408    409    410    411    412    413    415 
     1      2    119     22     15     12      5      2      1      1      1 
   416    417    418    419     42    420    421    422    423    424    425 
     1      1     25      1      5      1      9      1      2     24      1 
   426    428    429     43    433    435    436    437    439    442    443 
     4     13      2     29      2     12      2      7      6      1      5 
   444    445    446    447    448     45    450    451    452    454    455 
     5      4      1      2      4    250     10      6      4      1      1 
   456    457    458    459     46    460    461    465    466     47    470 
     2      3      2      5     26      3      1     12      3     72      5 
   477    478    479     48    481    482    483    484    485    486    487 
    23      2      1      5      1      2      2      2     14      2      1 
   489     49    490    491    492    494    496    497    498    499    501 
     1      5      5      2      2      4      1      1     58      1      1 
   504    506    507     51    511    512    513    514    515    516    518 
    11      1     23     99      6      3      2      2      6      1      2 
    52    520    521    523    525    528     53    530    531    534    539 
     2      1      7     19     15      2      8      5      1      2      2 
    54    541    543     59      6     61     62     64     65     67     68 
     1      1      2      7      1      2     47      1      1    267     46 
    71     72     73     76     77     78      8     80     81     82     83 
     2     30      3      1      1     22      1      1     23     14      9 
    84     85     86     88      9     91     92     93     94     96     99 
     3      2     32     22     37     48     13      3     87      1     12 
  NULL 
112593 
Code
table(hotels_booking$country)

  ABW   AGO   AIA   ALB   AND   ARE   ARG   ARM   ASM   ATA   ATF   AUS   AUT 
    2   362     1    12     7    51   214     8     1     2     1   426  1263 
  AZE   BDI   BEL   BEN   BFA   BGD   BGR   BHR   BHS   BIH   BLR   BOL   BRA 
   17     1  2342     3     1    12    75     5     1    13    26    10  2224 
  BRB   BWA   CAF   CHE   CHL   CHN   CIV   CMR    CN   COL   COM   CPV   CRI 
    4     1     5  1730    65   999     6    10  1279    71     2    24    19 
  CUB   CYM   CYP   CZE   DEU   DJI   DMA   DNK   DOM   DZA   ECU   EGY   ESP 
    8     1    51   171  7287     1     1   435    14   103    27    32  8568 
  EST   ETH   FIN   FJI   FRA   FRO   GAB   GBR   GEO   GGY   GHA   GIB   GLP 
   83     3   447     1 10415     5     4 12129    22     3     4    18     2 
  GNB   GRC   GTM   GUY   HKG   HND   HRV   HUN   IDN   IMN   IND   IRL   IRN 
    9   128     4     1    29     1   100   230    35     2   152  3375    83 
  IRQ   ISL   ISR   ITA   JAM   JEY   JOR   JPN   KAZ   KEN   KHM   KIR   KNA 
   14    57   669  3766     6     8    21   197    19     6     2     1     2 
  KOR   KWT   LAO   LBN   LBY   LCA   LIE   LKA   LTU   LUX   LVA   MAC   MAR 
  133    16     2    31     8     1     3     7    81   287    55    16   259 
  MCO   MDG   MDV   MEX   MKD   MLI   MLT   MMR   MNE   MOZ   MRT   MUS   MWI 
    4     1    12    85    10     1    18     1     5    67     1     7     2 
  MYS   MYT   NAM   NCL   NGA   NIC   NLD   NOR   NPL  NULL   NZL   OMN   PAK 
   28     2     1     1    34     1  2104   607     1   488    74    18    14 
  PAN   PER   PHL   PLW   POL   PRI   PRT   PRY   PYF   QAT   ROU   RUS   RWA 
    9    29    40     1   919    12 48590     4     1    15   500   632     2 
  SAU   SDN   SEN   SGP   SLE   SLV   SMR   SRB   STP   SUR   SVK   SVN   SWE 
   48     1    11    39     1     2     1   101     2     5    65    57  1024 
  SYC   SYR   TGO   THA   TJK   TMP   TUN   TUR   TWN   TZA   UGA   UKR   UMI 
    2     3     2    59     9     3    39   248    51     5     2    68     1 
  URY   USA   UZB   VEN   VGB   VNM   ZAF   ZMB   ZWE 
   32  2097     4    26     1     8    80     2     4 
Code
hotels_booking <- hotels_booking %>% 
  filter(!(country == "NULL"))

Any additional comments?

Identify variables that need to be mutated

Now that null values from countries has been removed, we can see that the company column also has a lot of NULL values. In addition to that, I am going to combine the day, month and year of arrival of all guests. This will help in better analysis of the data. As mentioned above, I am also removing the adr column.

Document your work here.

Code
hotels_booking_mutate <- hotels_booking %>% 
  mutate(arrival_date = str_c(arrival_date_day_of_month,
                              arrival_date_month,
                              arrival_date_year, sep="/"),
         arrival_date = dmy(arrival_date)) %>% 
  select(-c(arrival_date_day_of_month,arrival_date_month,arrival_date_year, adr))

hotels_booking_mutate
# A tibble: 118,902 × 29
   hotel     is_ca…¹ lead_…² arriv…³ stays…⁴ stays…⁵ adults child…⁶ babies meal 
   <chr>       <dbl>   <dbl>   <dbl>   <dbl>   <dbl>  <dbl>   <dbl>  <dbl> <chr>
 1 Resort H…       0     342      27       0       0      2       0      0 BB   
 2 Resort H…       0     737      27       0       0      2       0      0 BB   
 3 Resort H…       0       7      27       0       1      1       0      0 BB   
 4 Resort H…       0      13      27       0       1      1       0      0 BB   
 5 Resort H…       0      14      27       0       2      2       0      0 BB   
 6 Resort H…       0      14      27       0       2      2       0      0 BB   
 7 Resort H…       0       0      27       0       2      2       0      0 BB   
 8 Resort H…       0       9      27       0       2      2       0      0 FB   
 9 Resort H…       1      85      27       0       3      2       0      0 BB   
10 Resort H…       1      75      27       0       3      2       0      0 HB   
# … with 118,892 more rows, 19 more variables: country <chr>,
#   market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
#   reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
#   deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
#   customer_type <chr>, required_car_parking_spaces <dbl>,
#   total_of_special_requests <dbl>, reservation_status <chr>, …
Code
summary(hotels_booking_mutate$arrival_date)
        Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
"2015-07-01" "2016-03-14" "2016-09-07" "2016-08-29" "2017-03-19" "2017-08-31" 

Any additional comments?

We can see that the above mutation can help hotels in further analysing their busiest days and prepare accordingly. Next, I will change the class of the company column from character to numeric.

Code
hotels_booking_mutate <- hotels_booking_mutate %>%
  mutate(across(c(company),~ replace(.,str_detect(., "NULL"), NA))) %>% mutate_at(vars(company),as.numeric)
is.numeric(hotels_booking_mutate$company)
[1] TRUE
Source Code
---
title: "Challenge 4"
author: "Janhvi Joshi"
desription: "More data wrangling: pivoting"
date: "11/10/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_4
  - abc_poll
  - eggs
  - fed_rates
  - hotel_bookings
  - debt
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)
library(lubridate)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

## Challenge Overview

Today's challenge is to:

1)  read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
2)  tidy data (as needed, including sanity checks)
3)  identify variables that need to be mutated
4)  mutate variables and sanity check all mutations

## Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

-   abc_poll.csv ⭐
-   poultry_tidy.xlsx or organiceggpoultry.xls⭐⭐
-   FedFundsRate.csv⭐⭐⭐
-   hotel_bookings.csv⭐⭐⭐⭐
-   debt_in_trillions.xlsx ⭐⭐⭐⭐⭐

```{r}
hotels_booking <- read_csv("_data/hotel_bookings.csv")
hotels_booking
```

### Briefly describe the data
This dataset summarises various details about a hotel booking and contains 120k records and 32 columns dating from year 2015 to 2017. There are two types of hotels in this dataset - Resort Hotel and City Hotel. Customers from around the world; around 160-170 countries, book these hotels. It can also be seen that on an average, around 37% of the bookings are cancelled and around 3% of the guests are repeated. On an average, customers may need to wait for 2.3 days in the waitlist to confirm a booking and around 57% of these bookings include some special requests. The hotels provide 4 different types of meals to their customers. This dataset is likely gathered from various online and offline channels - from where the booking was made.

## Tidy Data (as needed)

Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.

I looked at multiple different columns and found a few that have many NULL values as shown below. So I will be removing rows where value of country is NULL since that record does not have much significance. The company columns also has many NULL values but since it is numerical, I will change it to NA. Additionally, I am not sure what the adr column depicts and will be removing it too since it is not useful for my analysis.

```{r}
table(hotels_booking$company)
table(hotels_booking$country)

```
```{r}
hotels_booking <- hotels_booking %>% 
  filter(!(country == "NULL"))
```

Any additional comments?

## Identify variables that need to be mutated

Now that null values from countries has been removed, we can see that the company column also has a lot of NULL values. In addition to that, I am going to combine the day, month and year of arrival of all guests. This will help in better analysis of the data. As mentioned above, I am also removing the adr column.

Document your work here.

```{r}
hotels_booking_mutate <- hotels_booking %>% 
  mutate(arrival_date = str_c(arrival_date_day_of_month,
                              arrival_date_month,
                              arrival_date_year, sep="/"),
         arrival_date = dmy(arrival_date)) %>% 
  select(-c(arrival_date_day_of_month,arrival_date_month,arrival_date_year, adr))

hotels_booking_mutate
```
```{r}
summary(hotels_booking_mutate$arrival_date)
```
Any additional comments?

We can see that the above mutation can help hotels in further analysing their busiest days and prepare accordingly. Next, I will change the class of the company column from character to numeric.

```{r}
hotels_booking_mutate <- hotels_booking_mutate %>%
  mutate(across(c(company),~ replace(.,str_detect(., "NULL"), NA))) %>% mutate_at(vars(company),as.numeric)
is.numeric(hotels_booking_mutate$company)
```