Code
library(tidyverse)
library(lubridate)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Janhvi Joshi
November 10, 2022
Today’s challenge is to:
Read in one (or more) of the following datasets, using the correct R package and command.
# A tibble: 119,390 × 32
hotel is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
<chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Resor… 0 342 2015 July 27 1 0 0 2
2 Resor… 0 737 2015 July 27 1 0 0 2
3 Resor… 0 7 2015 July 27 1 0 1 1
4 Resor… 0 13 2015 July 27 1 0 1 1
5 Resor… 0 14 2015 July 27 1 0 2 2
6 Resor… 0 14 2015 July 27 1 0 2 2
7 Resor… 0 0 2015 July 27 1 0 2 2
8 Resor… 0 9 2015 July 27 1 0 2 2
9 Resor… 1 85 2015 July 27 1 0 3 2
10 Resor… 1 75 2015 July 27 1 0 3 2
# … with 119,380 more rows, 22 more variables: children <dbl>, babies <dbl>,
# meal <chr>, country <chr>, market_segment <chr>,
# distribution_channel <chr>, is_repeated_guest <dbl>,
# previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
# reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
# deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
# customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …
This dataset summarises various details about a hotel booking and contains 120k records and 32 columns dating from year 2015 to 2017. There are two types of hotels in this dataset - Resort Hotel and City Hotel. Customers from around the world; around 160-170 countries, book these hotels. It can also be seen that on an average, around 37% of the bookings are cancelled and around 3% of the guests are repeated. On an average, customers may need to wait for 2.3 days in the waitlist to confirm a booking and around 57% of these bookings include some special requests. The hotels provide 4 different types of meals to their customers. This dataset is likely gathered from various online and offline channels - from where the booking was made.
Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.
I looked at multiple different columns and found a few that have many NULL values as shown below. So I will be removing rows where value of country is NULL since that record does not have much significance. The company columns also has many NULL values but since it is numerical, I will change it to NA. Additionally, I am not sure what the adr column depicts and will be removing it too since it is not useful for my analysis.
10 100 101 102 103 104 105 106 107 108 109
1 1 1 1 16 1 8 2 9 11 1
11 110 112 113 115 116 118 12 120 122 126
1 52 13 36 4 6 7 14 14 18 1
127 130 132 135 137 139 14 140 142 143 144
15 12 1 66 4 3 9 1 1 17 27
146 148 149 150 153 154 158 159 16 160 163
3 37 5 19 215 133 2 6 5 1 17
165 167 168 169 174 178 179 18 180 183 184
3 7 2 65 149 27 24 1 5 16 1
185 186 192 193 195 197 20 200 202 203 204
4 12 4 16 38 47 50 3 38 13 34
207 209 210 212 213 215 216 217 218 219 22
9 19 2 1 1 8 21 2 43 141 6
220 221 222 223 224 225 227 229 230 232 233
4 27 2 784 3 7 24 1 3 2 114
234 237 238 240 242 243 245 246 250 251 253
1 1 33 3 62 2 3 3 2 18 1
254 255 257 258 259 260 263 264 268 269 270
10 6 1 1 2 3 14 2 14 33 43
271 272 273 274 275 277 278 279 28 280 281
2 3 1 14 3 5 2 8 5 48 138
282 284 286 287 288 289 29 290 291 292 293
4 1 21 5 1 2 2 17 12 18 5
297 301 302 304 305 307 308 309 31 311 312
7 1 5 2 1 36 33 1 17 2 3
313 314 316 317 318 319 32 320 321 323 324
1 1 2 9 1 3 1 1 2 10 9
325 329 330 331 332 333 334 337 338 34 341
2 12 4 61 2 11 3 25 12 8 5
342 343 346 347 348 349 35 350 351 352 353
48 29 14 1 59 2 1 3 2 1 4
355 356 357 358 360 361 362 364 365 366 367
13 10 5 7 12 2 2 6 29 24 14
368 369 37 370 371 372 373 376 377 378 379
1 5 10 2 11 3 1 1 5 3 9
38 380 382 383 384 385 386 388 39 390 391
51 12 5 6 9 30 1 7 8 13 2
392 393 394 395 396 397 398 399 40 400 401
4 1 6 4 18 15 1 11 927 2 1
402 403 405 407 408 409 410 411 412 413 415
1 2 119 22 15 12 5 2 1 1 1
416 417 418 419 42 420 421 422 423 424 425
1 1 25 1 5 1 9 1 2 24 1
426 428 429 43 433 435 436 437 439 442 443
4 13 2 29 2 12 2 7 6 1 5
444 445 446 447 448 45 450 451 452 454 455
5 4 1 2 4 250 10 6 4 1 1
456 457 458 459 46 460 461 465 466 47 470
2 3 2 5 26 3 1 12 3 72 5
477 478 479 48 481 482 483 484 485 486 487
23 2 1 5 1 2 2 2 14 2 1
489 49 490 491 492 494 496 497 498 499 501
1 5 5 2 2 4 1 1 58 1 1
504 506 507 51 511 512 513 514 515 516 518
11 1 23 99 6 3 2 2 6 1 2
52 520 521 523 525 528 53 530 531 534 539
2 1 7 19 15 2 8 5 1 2 2
54 541 543 59 6 61 62 64 65 67 68
1 1 2 7 1 2 47 1 1 267 46
71 72 73 76 77 78 8 80 81 82 83
2 30 3 1 1 22 1 1 23 14 9
84 85 86 88 9 91 92 93 94 96 99
3 2 32 22 37 48 13 3 87 1 12
NULL
112593
ABW AGO AIA ALB AND ARE ARG ARM ASM ATA ATF AUS AUT
2 362 1 12 7 51 214 8 1 2 1 426 1263
AZE BDI BEL BEN BFA BGD BGR BHR BHS BIH BLR BOL BRA
17 1 2342 3 1 12 75 5 1 13 26 10 2224
BRB BWA CAF CHE CHL CHN CIV CMR CN COL COM CPV CRI
4 1 5 1730 65 999 6 10 1279 71 2 24 19
CUB CYM CYP CZE DEU DJI DMA DNK DOM DZA ECU EGY ESP
8 1 51 171 7287 1 1 435 14 103 27 32 8568
EST ETH FIN FJI FRA FRO GAB GBR GEO GGY GHA GIB GLP
83 3 447 1 10415 5 4 12129 22 3 4 18 2
GNB GRC GTM GUY HKG HND HRV HUN IDN IMN IND IRL IRN
9 128 4 1 29 1 100 230 35 2 152 3375 83
IRQ ISL ISR ITA JAM JEY JOR JPN KAZ KEN KHM KIR KNA
14 57 669 3766 6 8 21 197 19 6 2 1 2
KOR KWT LAO LBN LBY LCA LIE LKA LTU LUX LVA MAC MAR
133 16 2 31 8 1 3 7 81 287 55 16 259
MCO MDG MDV MEX MKD MLI MLT MMR MNE MOZ MRT MUS MWI
4 1 12 85 10 1 18 1 5 67 1 7 2
MYS MYT NAM NCL NGA NIC NLD NOR NPL NULL NZL OMN PAK
28 2 1 1 34 1 2104 607 1 488 74 18 14
PAN PER PHL PLW POL PRI PRT PRY PYF QAT ROU RUS RWA
9 29 40 1 919 12 48590 4 1 15 500 632 2
SAU SDN SEN SGP SLE SLV SMR SRB STP SUR SVK SVN SWE
48 1 11 39 1 2 1 101 2 5 65 57 1024
SYC SYR TGO THA TJK TMP TUN TUR TWN TZA UGA UKR UMI
2 3 2 59 9 3 39 248 51 5 2 68 1
URY USA UZB VEN VGB VNM ZAF ZMB ZWE
32 2097 4 26 1 8 80 2 4
Any additional comments?
Now that null values from countries has been removed, we can see that the company column also has a lot of NULL values. In addition to that, I am going to combine the day, month and year of arrival of all guests. This will help in better analysis of the data. As mentioned above, I am also removing the adr column.
Document your work here.
# A tibble: 118,902 × 29
hotel is_ca…¹ lead_…² arriv…³ stays…⁴ stays…⁵ adults child…⁶ babies meal
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 Resort H… 0 342 27 0 0 2 0 0 BB
2 Resort H… 0 737 27 0 0 2 0 0 BB
3 Resort H… 0 7 27 0 1 1 0 0 BB
4 Resort H… 0 13 27 0 1 1 0 0 BB
5 Resort H… 0 14 27 0 2 2 0 0 BB
6 Resort H… 0 14 27 0 2 2 0 0 BB
7 Resort H… 0 0 27 0 2 2 0 0 BB
8 Resort H… 0 9 27 0 2 2 0 0 FB
9 Resort H… 1 85 27 0 3 2 0 0 BB
10 Resort H… 1 75 27 0 3 2 0 0 HB
# … with 118,892 more rows, 19 more variables: country <chr>,
# market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,
# previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
# reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
# deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
# customer_type <chr>, required_car_parking_spaces <dbl>,
# total_of_special_requests <dbl>, reservation_status <chr>, …
Min. 1st Qu. Median Mean 3rd Qu. Max.
"2015-07-01" "2016-03-14" "2016-09-07" "2016-08-29" "2017-03-19" "2017-08-31"
Any additional comments?
We can see that the above mutation can help hotels in further analysing their busiest days and prepare accordingly. Next, I will change the class of the company column from character to numeric.
---
title: "Challenge 4"
author: "Janhvi Joshi"
desription: "More data wrangling: pivoting"
date: "11/10/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_4
- abc_poll
- eggs
- fed_rates
- hotel_bookings
- debt
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
library(lubridate)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to:
1) read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
2) tidy data (as needed, including sanity checks)
3) identify variables that need to be mutated
4) mutate variables and sanity check all mutations
## Read in data
Read in one (or more) of the following datasets, using the correct R package and command.
- abc_poll.csv ⭐
- poultry_tidy.xlsx or organiceggpoultry.xls⭐⭐
- FedFundsRate.csv⭐⭐⭐
- hotel_bookings.csv⭐⭐⭐⭐
- debt_in_trillions.xlsx ⭐⭐⭐⭐⭐
```{r}
hotels_booking <- read_csv("_data/hotel_bookings.csv")
hotels_booking
```
### Briefly describe the data
This dataset summarises various details about a hotel booking and contains 120k records and 32 columns dating from year 2015 to 2017. There are two types of hotels in this dataset - Resort Hotel and City Hotel. Customers from around the world; around 160-170 countries, book these hotels. It can also be seen that on an average, around 37% of the bookings are cancelled and around 3% of the guests are repeated. On an average, customers may need to wait for 2.3 days in the waitlist to confirm a booking and around 57% of these bookings include some special requests. The hotels provide 4 different types of meals to their customers. This dataset is likely gathered from various online and offline channels - from where the booking was made.
## Tidy Data (as needed)
Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.
I looked at multiple different columns and found a few that have many NULL values as shown below. So I will be removing rows where value of country is NULL since that record does not have much significance. The company columns also has many NULL values but since it is numerical, I will change it to NA. Additionally, I am not sure what the adr column depicts and will be removing it too since it is not useful for my analysis.
```{r}
table(hotels_booking$company)
table(hotels_booking$country)
```
```{r}
hotels_booking <- hotels_booking %>%
filter(!(country == "NULL"))
```
Any additional comments?
## Identify variables that need to be mutated
Now that null values from countries has been removed, we can see that the company column also has a lot of NULL values. In addition to that, I am going to combine the day, month and year of arrival of all guests. This will help in better analysis of the data. As mentioned above, I am also removing the adr column.
Document your work here.
```{r}
hotels_booking_mutate <- hotels_booking %>%
mutate(arrival_date = str_c(arrival_date_day_of_month,
arrival_date_month,
arrival_date_year, sep="/"),
arrival_date = dmy(arrival_date)) %>%
select(-c(arrival_date_day_of_month,arrival_date_month,arrival_date_year, adr))
hotels_booking_mutate
```
```{r}
summary(hotels_booking_mutate$arrival_date)
```
Any additional comments?
We can see that the above mutation can help hotels in further analysing their busiest days and prepare accordingly. Next, I will change the class of the company column from character to numeric.
```{r}
hotels_booking_mutate <- hotels_booking_mutate %>%
mutate(across(c(company),~ replace(.,str_detect(., "NULL"), NA))) %>% mutate_at(vars(company),as.numeric)
is.numeric(hotels_booking_mutate$company)
```