challenge_2
Data wrangling: using group() and summarise()
Author

Cristhian Barba Garzon

Published

December 27, 2022

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to

  1. read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
  2. provide summary statistics for different interesting groups within the data, and interpret those statistics

Read in the Data

Read in one (or more) of the following data sets, available in the posts/_data folder, using the correct R package and command.

  • railroad*.csv or StateCounty2012.xlsx ⭐
  • FAOstat*.csv ⭐⭐⭐
  • hotel_bookings ⭐⭐⭐⭐
Code
h_bookings = read_csv("_data/hotel_bookings.csv") #sets the read_csv() function to a named variable
view(h_bookings) #opens the data set to be viewed by an audience 

Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.

Describe the data

Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).

Describing the data

The commands below are great in helping to describe the data chosen: hotel bookings. The dimensions command, dim(), tells us that there are 119,390 rows and 32 columns–meaning there are 119,390 observations recorded. The column names command, colnames(), tells us the names of these 32 columns, and this is useful when trying to look through your data set to select specific columns to analyze. Furtherdown, the use of piping helps in creating clean command lines to select specific columns and identify distinct values in those columns. By selecting the “hotel”, “country”, and “arrival_date_year” columns, I am able to identify unique values within those columns. There are only two types of hotels chosen for this data: resort and city. Additionally, the data was taken from 10 different countries and from 3 different years. This information is important because it tells use where the data is from and what time period it is from.

Code
h_bookings %>%
  dim() #tells use the dimensions of the dataset; the dimensions can tell us how many observed values (rows) there are
[1] 119390     32
Code
colnames(h_bookings) #tells us the name of each column--there are 32 columns, so this is useful
 [1] "hotel"                          "is_canceled"                   
 [3] "lead_time"                      "arrival_date_year"             
 [5] "arrival_date_month"             "arrival_date_week_number"      
 [7] "arrival_date_day_of_month"      "stays_in_weekend_nights"       
 [9] "stays_in_week_nights"           "adults"                        
[11] "children"                       "babies"                        
[13] "meal"                           "country"                       
[15] "market_segment"                 "distribution_channel"          
[17] "is_repeated_guest"              "previous_cancellations"        
[19] "previous_bookings_not_canceled" "reserved_room_type"            
[21] "assigned_room_type"             "booking_changes"               
[23] "deposit_type"                   "agent"                         
[25] "company"                        "days_in_waiting_list"          
[27] "customer_type"                  "adr"                           
[29] "required_car_parking_spaces"    "total_of_special_requests"     
[31] "reservation_status"             "reservation_status_date"       
Code
h_bookings %>%
  select("hotel") %>%
  distinct() #tells us the distinct values in the "hotel" column; there are only two different values: resort and city hotels 
# A tibble: 2 × 1
  hotel       
  <chr>       
1 Resort Hotel
2 City Hotel  
Code
h_bookings %>%
  select("country")%>%
  distinct() #allows use to select the country column and find distinct values; this simply tells us the different countries that data was taken from for hotels. There are about 10 different countires where data was gathered from.
# A tibble: 178 × 1
   country
   <chr>  
 1 PRT    
 2 GBR    
 3 USA    
 4 ESP    
 5 IRL    
 6 FRA    
 7 NULL   
 8 ROU    
 9 NOR    
10 OMN    
# … with 168 more rows
Code
h_bookings %>% 
  select('arrival_date_year') %>%
  distinct() #this section tells us the distinct years that this data was taken from. This data was taken from the years 2015,2016, and 2017. 
# A tibble: 3 × 1
  arrival_date_year
              <dbl>
1              2015
2              2016
3              2017

Provide Grouped Summary Statistics

Conduct some exploratory data analysis, using dplyr commands such as group_by(), select(), filter(), and summarise(). Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.

Explaining and Interpreting Grouped Summary Statistics

The commands below are great in helping find the mean, median, and mode in different columns in the hotel bookings data set. The summarize command, summarize(), helps in summarizing information within the column. Specifically, the mean and median commands are used within the summarize command to find the values for the “stays_in_week_nights” column; this simply allows us to know the average and median amount of nights guests stayed in their respective hotels. Using the na.rm = TRUE command allows me to completely eliminate any unknown values in the column. Since there is no mode command in R, a function was used to find it manually. The mode for the amount of week nights stayed is found.

Furthermore, the summarize command is continously helpful with the additional maximum, minimum, and standard deviation commands. I chose to find the maximum amount of adults, using the “adults” column, the maximum and minimum amount of babies, using the “babies” column, and the maximum amount of weekend nights stayed. This is important information to know about this data set because it tells us how the guests behave; considering the maximum amount of weekend nights stayed, one can infer that this particular booking was over a long period of time. Knowing the amount of adults tells us that people can travel in very large groups to either of these hotels. You can also analyze the difference in the maximum amount of adults and babies; the difference between the two is very large–by 45. This difference can tell us that people with children may travel less, but further analysis may be required to make any assumptions. This is all important information that any viewer can use to understand the data and see where it could come from or how different values may relate to one another.

Lastly, using the filter command is important when finding specific values. Below, the filter command helps to find rows that only contain “September” in the “arrival_date_month” column; this can be used to analyze how many bookings were made for that specific month.

Code
summarize(h_bookings, mean_stay = mean(`stays_in_week_nights`, na.rm=TRUE)) #this allows us to see what the average amount of nights families stay in their respective hotels. 
# A tibble: 1 × 1
  mean_stay
      <dbl>
1      2.50
Code
summarize(h_bookings, median_stay = median(`stays_in_week_nights`,na.rm=TRUE)) #gives us the median for the amount of nights hotel guests stayed.
# A tibble: 1 × 1
  median_stay
        <dbl>
1           2
Code
mode = function(x){
  index = which.max(table(x)) # which.max() returns the index of the max value in a vector
  mode = x[index] # use the mode index to find the mode in the original vector
  return(mode)
}
mode(h_bookings$stays_in_week_nights) #this finds the mode for the amount of nights people have stayed in their hotels. 
[1] 1
Code
summarize(h_bookings, Maximum_nights_stayed_weekend = max(`stays_in_weekend_nights`, na.rm = TRUE), maximum_adults = max(`adults`, na.rm = TRUE), min_babies = min(`babies`,na.rm=TRUE), max_babies = max(`babies`, na.rm=TRUE)) #these commands tell us the max amount of nights stayed during the weel, the most amount of adults in one booking, and the maximum amount of babies in one booking.
# A tibble: 1 × 4
  Maximum_nights_stayed_weekend maximum_adults min_babies max_babies
                          <dbl>          <dbl>      <dbl>      <dbl>
1                            19             55          0         10
Code
summarize(h_bookings,sd_adults = sd(`adults`, na.rm = TRUE)) #this tells us the standard deviation of the amount of adults stayed for all bookings
# A tibble: 1 × 1
  sd_adults
      <dbl>
1     0.579
Code
filter(h_bookings, `arrival_date_month` == 'September', na.rm=TRUE) #will select rows that only contain the value "September"
# A tibble: 10,508 × 32
   hotel  is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
   <chr>    <dbl>   <dbl>   <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
 1 Resor…       1      38    2015 Septem…      36       1       0       2      2
 2 Resor…       1      47    2015 Septem…      36       1       0       2      2
 3 Resor…       0      48    2015 Septem…      36       1       0       3      2
 4 Resor…       0      48    2015 Septem…      36       1       0       3      2
 5 Resor…       0      27    2015 Septem…      36       1       0       3      2
 6 Resor…       1      71    2015 Septem…      36       1       0       3      2
 7 Resor…       1      85    2015 Septem…      36       1       0       4      2
 8 Resor…       1     132    2015 Septem…      36       1       0       4      2
 9 Resor…       0      50    2015 Septem…      36       1       0       4      2
10 Resor…       0     122    2015 Septem…      36       1       0       4      2
# … with 10,498 more rows, 22 more variables: children <dbl>, babies <dbl>,
#   meal <chr>, country <chr>, market_segment <chr>,
#   distribution_channel <chr>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
#   reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
#   deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
#   customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …