Code
library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Yoshita Varma Annam
December 21, 2022
Today’s challenge is to
Read in one (or more) of the following data sets, available in the posts/_data
folder, using the correct R package and command.
Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.
Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).
# A tibble: 119,390 × 32
hotel is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
<chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Resor… 0 342 2015 July 27 1 0 0 2
2 Resor… 0 737 2015 July 27 1 0 0 2
3 Resor… 0 7 2015 July 27 1 0 1 1
4 Resor… 0 13 2015 July 27 1 0 1 1
5 Resor… 0 14 2015 July 27 1 0 2 2
6 Resor… 0 14 2015 July 27 1 0 2 2
7 Resor… 0 0 2015 July 27 1 0 2 2
8 Resor… 0 9 2015 July 27 1 0 2 2
9 Resor… 1 85 2015 July 27 1 0 3 2
10 Resor… 1 75 2015 July 27 1 0 3 2
# … with 119,380 more rows, 22 more variables: children <dbl>, babies <dbl>,
# meal <chr>, country <chr>, market_segment <chr>,
# distribution_channel <chr>, is_repeated_guest <dbl>,
# previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
# reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
# deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
# customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …
By just viewing the data it looks like the data is about 119,390 hotel entries and detailing for 32 features. The features mainly describe the booking entirely based on their arrival, cancellations and timings. It also accounts the number of babies, children, adults across the world. There is a separate field to verify for the repeated guests. To understand further we need to perform more operations.
hotel is_canceled lead_time arrival_date_year
Length:119390 Min. :0.0000 Min. : 0 Min. :2015
Class :character 1st Qu.:0.0000 1st Qu.: 18 1st Qu.:2016
Mode :character Median :0.0000 Median : 69 Median :2016
Mean :0.3704 Mean :104 Mean :2016
3rd Qu.:1.0000 3rd Qu.:160 3rd Qu.:2017
Max. :1.0000 Max. :737 Max. :2017
arrival_date_month arrival_date_week_number arrival_date_day_of_month
Length:119390 Min. : 1.00 Min. : 1.0
Class :character 1st Qu.:16.00 1st Qu.: 8.0
Mode :character Median :28.00 Median :16.0
Mean :27.17 Mean :15.8
3rd Qu.:38.00 3rd Qu.:23.0
Max. :53.00 Max. :31.0
stays_in_weekend_nights stays_in_week_nights adults
Min. : 0.0000 Min. : 0.0 Min. : 0.000
1st Qu.: 0.0000 1st Qu.: 1.0 1st Qu.: 2.000
Median : 1.0000 Median : 2.0 Median : 2.000
Mean : 0.9276 Mean : 2.5 Mean : 1.856
3rd Qu.: 2.0000 3rd Qu.: 3.0 3rd Qu.: 2.000
Max. :19.0000 Max. :50.0 Max. :55.000
children babies meal country
Min. : 0.0000 Min. : 0.000000 Length:119390 Length:119390
1st Qu.: 0.0000 1st Qu.: 0.000000 Class :character Class :character
Median : 0.0000 Median : 0.000000 Mode :character Mode :character
Mean : 0.1039 Mean : 0.007949
3rd Qu.: 0.0000 3rd Qu.: 0.000000
Max. :10.0000 Max. :10.000000
NA's :4
market_segment distribution_channel is_repeated_guest
Length:119390 Length:119390 Min. :0.00000
Class :character Class :character 1st Qu.:0.00000
Mode :character Mode :character Median :0.00000
Mean :0.03191
3rd Qu.:0.00000
Max. :1.00000
previous_cancellations previous_bookings_not_canceled reserved_room_type
Min. : 0.00000 Min. : 0.0000 Length:119390
1st Qu.: 0.00000 1st Qu.: 0.0000 Class :character
Median : 0.00000 Median : 0.0000 Mode :character
Mean : 0.08712 Mean : 0.1371
3rd Qu.: 0.00000 3rd Qu.: 0.0000
Max. :26.00000 Max. :72.0000
assigned_room_type booking_changes deposit_type agent
Length:119390 Min. : 0.0000 Length:119390 Length:119390
Class :character 1st Qu.: 0.0000 Class :character Class :character
Mode :character Median : 0.0000 Mode :character Mode :character
Mean : 0.2211
3rd Qu.: 0.0000
Max. :21.0000
company days_in_waiting_list customer_type adr
Length:119390 Min. : 0.000 Length:119390 Min. : -6.38
Class :character 1st Qu.: 0.000 Class :character 1st Qu.: 69.29
Mode :character Median : 0.000 Mode :character Median : 94.58
Mean : 2.321 Mean : 101.83
3rd Qu.: 0.000 3rd Qu.: 126.00
Max. :391.000 Max. :5400.00
required_car_parking_spaces total_of_special_requests reservation_status
Min. :0.00000 Min. :0.0000 Length:119390
1st Qu.:0.00000 1st Qu.:0.0000 Class :character
Median :0.00000 Median :0.0000 Mode :character
Mean :0.06252 Mean :0.5714
3rd Qu.:0.00000 3rd Qu.:1.0000
Max. :8.00000 Max. :5.0000
reservation_status_date
Min. :2014-10-17
1st Qu.:2016-02-01
Median :2016-08-07
Mean :2016-07-30
3rd Qu.:2017-02-08
Max. :2017-09-14
[1] "hotel" "is_canceled"
[3] "lead_time" "arrival_date_year"
[5] "arrival_date_month" "arrival_date_week_number"
[7] "arrival_date_day_of_month" "stays_in_weekend_nights"
[9] "stays_in_week_nights" "adults"
[11] "children" "babies"
[13] "meal" "country"
[15] "market_segment" "distribution_channel"
[17] "is_repeated_guest" "previous_cancellations"
[19] "previous_bookings_not_canceled" "reserved_room_type"
[21] "assigned_room_type" "booking_changes"
[23] "deposit_type" "agent"
[25] "company" "days_in_waiting_list"
[27] "customer_type" "adr"
[29] "required_car_parking_spaces" "total_of_special_requests"
[31] "reservation_status" "reservation_status_date"
[1] "No Deposit" "Refundable" "Non Refund"
[1] 8
[1] "Direct" "Corporate" "Online TA" "Offline TA/TO"
[5] "Complementary" "Groups" "Undefined" "Aviation"
[1] 8
[1] "Direct" "Corporate" "TA/TO" "Undefined" "GDS"
[1] 5
[1] "Resort Hotel" "City Hotel"
[1] 2
[1] "PRT" "GBR" "USA" "ESP" "IRL" "FRA" "NULL" "ROU" "NOR" "OMN"
[11] "ARG" "POL" "DEU" "BEL" "CHE" "CN" "GRC" "ITA" "NLD" "DNK"
[21] "RUS" "SWE" "AUS" "EST" "CZE" "BRA" "FIN" "MOZ" "BWA" "LUX"
[31] "SVN" "ALB" "IND" "CHN" "MEX" "MAR" "UKR" "SMR" "LVA" "PRI"
[41] "SRB" "CHL" "AUT" "BLR" "LTU" "TUR" "ZAF" "AGO" "ISR" "CYM"
[51] "ZMB" "CPV" "ZWE" "DZA" "KOR" "CRI" "HUN" "ARE" "TUN" "JAM"
[61] "HRV" "HKG" "IRN" "GEO" "AND" "GIB" "URY" "JEY" "CAF" "CYP"
[71] "COL" "GGY" "KWT" "NGA" "MDV" "VEN" "SVK" "FJI" "KAZ" "PAK"
[81] "IDN" "LBN" "PHL" "SEN" "SYC" "AZE" "BHR" "NZL" "THA" "DOM"
[91] "MKD" "MYS" "ARM" "JPN" "LKA" "CUB" "CMR" "BIH" "MUS" "COM"
[101] "SUR" "UGA" "BGR" "CIV" "JOR" "SYR" "SGP" "BDI" "SAU" "VNM"
[111] "PLW" "QAT" "EGY" "PER" "MLT" "MWI" "ECU" "MDG" "ISL" "UZB"
[121] "NPL" "BHS" "MAC" "TGO" "TWN" "DJI" "STP" "KNA" "ETH" "IRQ"
[131] "HND" "RWA" "KHM" "MCO" "BGD" "IMN" "TJK" "NIC" "BEN" "VGB"
[141] "TZA" "GAB" "GHA" "TMP" "GLP" "KEN" "LIE" "GNB" "MNE" "UMI"
[151] "MYT" "FRO" "MMR" "PAN" "BFA" "LBY" "MLI" "NAM" "BOL" "PRY"
[161] "BRB" "ABW" "AIA" "SLV" "DMA" "PYF" "GUY" "LCA" "ATA" "GTM"
[171] "ASM" "MRT" "NCL" "KIR" "SDN" "ATF" "SLE" "LAO"
[1] 178
After the following analysis it is clear that the data has been collected across the world for different countries approximately 150-180 from 2015 to 2017. The data is very specific to two kinds of hotels- “Resort Hotel”, “City Hotel”. There are majorly 8 kinds of bookings which include all the professional to personal types like- Corporate, Aviation etc. If we observe the mean from the summaries it can be said that there were approximately 185% adults, children 10%, and 1% baby have come to stay in the hotels. Similarly, on an average people stayed for 2.5 days during the week and 1 day during the weekends. The stats are only based on the summaries. To further colcude more accurately for this data need more analysis.
Conduct some exploratory data analysis, using dplyr commands such as group_by()
, select()
, filter()
, and summarise()
. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
# A tibble: 8 × 4
market_segment Adults_count Children_count Babies_count
<chr> <dbl> <dbl> <dbl>
1 Aviation 238 0 0
2 Complementary 1104 61 22
3 Corporate 6545 53 16
4 Direct 23645 2248 289
5 Groups 35598 58 13
6 Offline TA/TO 44060 653 178
7 Online TA 110441 9330 431
8 Undefined 5 0 0
# A tibble: 8 × 3
market_segment Week_stays_count Weekend_stays_count
<chr> <dbl> <dbl>
1 Aviation 2.51 1.09
2 Complementary 1.30 0.351
3 Corporate 1.65 0.438
4 Direct 2.35 0.856
5 Groups 2.20 0.789
6 Offline TA/TO 2.85 1.05
7 Online TA 2.58 0.991
8 Undefined 1 0.5
I choose market segment to understand what kind of bookings are done. As you can see when the professional kind of bookings done it is majorly for adults and very few children and babies come. Whereas when bookings are made through Online, Offline TA/TO or Direct a lot of children and babies come. This can explain that these kind of bookings can be for personal vacations or stays. Also, based on the type of days each market segment choose the hotels can make arrangements accordingly. For example- for Offline TA/TO booking has highest week and weekend stays. Therefore, hotels can give more importance for the booking coming from this medium as there are high chances of them not cancelling their booking.
---
title: "Challenge 2"
author: "Yoshita Varma Annam"
description: "Data wrangling: using group() and summarise()"
date: "12/21/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_2
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to
1) read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
2) provide summary statistics for different interesting groups within the data, and interpret those statistics
## Read in the Data
Read in one (or more) of the following data sets, available in the `posts/_data` folder, using the correct R package and command.
- railroad\*.csv or StateCounty2012.xlsx ⭐
- FAOstat\*.csv ⭐⭐⭐
- hotel_bookings ⭐⭐⭐⭐
```{r}
library(readr)
HotelBookings_csv <- read_csv("_data/hotel_bookings.csv")
```
Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.
## Describe the data
Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).
```{r}
#| label: understanding the data
HotelBookings_csv
```
By just viewing the data it looks like the data is about 119,390 hotel entries and detailing for 32 features. The features mainly describe the booking entirely based on their arrival, cancellations and timings. It also accounts the number of babies, children, adults across the world. There is a separate field to verify for the repeated guests. To understand further we need to perform more operations.
```{r}
#| label: summary of the data
summary(HotelBookings_csv)
```
```{r}
#| label: column names the data
colnames(HotelBookings_csv)
```
```{r}
#| label: finding unique values of the data
unique(HotelBookings_csv$deposit_type)
length(unique(HotelBookings_csv$market_segment))
unique(HotelBookings_csv$market_segment)
length(unique(HotelBookings_csv$market_segment))
unique(HotelBookings_csv$distribution_channel)
length(unique(HotelBookings_csv$distribution_channel))
unique(HotelBookings_csv$hotel)
length(unique(HotelBookings_csv$hotel))
unique(HotelBookings_csv$country)
length(unique(HotelBookings_csv$country))
```
After the following analysis it is clear that the data has been collected across the world for different countries approximately 150-180 from 2015 to 2017. The data is very specific to two kinds of hotels- "Resort Hotel", "City Hotel". There are majorly 8 kinds of bookings which include all the professional to personal types like- Corporate, Aviation etc. If we observe the mean from the summaries it can be said that there were approximately 185% adults, children 10%, and 1% baby have come to stay in the hotels. Similarly, on an average people stayed for 2.5 days during the week and 1 day during the weekends. The stats are only based on the summaries. To further colcude more accurately for this data need more analysis.
## Provide Grouped Summary Statistics
Conduct some exploratory data analysis, using dplyr commands such as `group_by()`, `select()`, `filter()`, and `summarise()`. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
```{r}
Hotel_Bookings_Market <- HotelBookings_csv %>%
group_by(market_segment)
view(Hotel_Bookings_Market)
```
```{r}
Hotel_Bookings_Market %>%
summarise(Adults_count = sum(adults,na.rm = TRUE),
Children_count = sum(children,na.rm = TRUE),
Babies_count = sum(babies, na.rm = TRUE))
```
```{r}
Hotel_Bookings_Market %>%
summarise(Week_stays_count = mean(stays_in_week_nights, nr.rm = TRUE),
Weekend_stays_count = mean(stays_in_weekend_nights, nr.rm = TRUE))
```
### Explain and Interpret
I choose market segment to understand what kind of bookings are done. As you can see when the professional kind of bookings done it is majorly for adults and very few children and babies come. Whereas when bookings are made through Online, Offline TA/TO or Direct a lot of children and babies come. This can explain that these kind of bookings can be for personal vacations or stays. Also, based on the type of days each market segment choose the hotels can make arrangements accordingly. For example- for Offline TA/TO booking has highest week and weekend stays. Therefore, hotels can give more importance for the booking coming from this medium as there are high chances of them not cancelling their booking.