Today’s challenge is to 1) read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc) 2) provide summary statistics for different interesting groups within the data, and interpret those statistics
reservation status consists of 3 types of data, namely : check-out,cancelled, No-show is cancelled has boolean data, and it has 0,1 where as arrival date_year has 3 columns
Code
table(hotel_bookings$arrival_date_month)
April August December February January July June March
11089 13877 6780 8068 5929 12661 10939 9794
May November October September
11791 6794 11160 10508
we see that there are highest number of bookings in the month of may- which gives an idea that summer is the season where there are maximum number of bookings
Code
table(hotel_bookings$arrival_date_year)
2015 2016 2017
21996 56707 40687
we see that there are maximum bookings in 2016
Provide Grouped Summary Statistics
Conduct some exploratory data analysis, using dplyr commands such as group_by(), select(), filter(), and summarise(). Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
Generated by summarytools 1.0.1 (R version 4.2.1) 2022-12-21
Code
a <- hotel_bookings%>%group_by(hotel)%>%count(reserved_room_type)%>%pivot_wider(names_from= hotel, values_from = n)a
# A tibble: 10 × 3
reserved_room_type `City Hotel` `Resort Hotel`
<chr> <int> <int>
1 A 62595 23399
2 B 1115 3
3 C 14 918
4 D 11768 7433
5 E 1553 4982
6 F 1791 1106
7 G 484 1610
8 P 10 2
9 H NA 601
10 L NA 6
More number of reservations are made in city hotel for the room type of A and the same is observed in resort hotel aswell.
Least number of reservations are made for the room type of P.
We do not observe the types of H and L in room type of L
we can say that there are no-missing values present in the data, yet there are duplicates.
Code
a <- hotel_bookings%>%group_by(hotel)%>%count(is_canceled)%>%pivot_wider(names_from= hotel, values_from = n)a
For the hotel type here , we analyse the cancellations for resort and city hotel types once the booking is made.
we see that the % people who cancelled the booking/ didn’t show is more in city hotel is about 41.7% and resort hotel is about 27.7%
Source Code
---title: "Challenge 2 Instructions"author: "Tejaswini_Ketineni"desription: "Reading in data and creating a post"date: "08/21/2022"format: html: toc: true code-fold: true code-copy: true code-tools: truecategories: - challenge_1 - railroads - faostat - wildbirds---##Challenge 2##importing necessary libraries```{r}#| label: setup#| warning: false#| message: falselibrary(tidyverse)knitr::opts_chunk$set(echo =TRUE, warning=FALSE, message=FALSE)```## Challenge 2 OverviewToday's challenge is to1) read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)2) provide summary statistics for different interesting groups within the data, and interpret those statistics## Read in the DataI am working with the hotel_bookings data set```{r}library(readr)hotel_bookings <-read_csv("_data/hotel_bookings.csv")```## Describe the dataInitially viewing the first few records of the data```{r}head(hotel_bookings)```identifying the column names in the hotel_bookings```{r}colnames(hotel_bookings)```Observing the size of hotel_bookings```{r}dim(hotel_bookings)```-It consists of 119390 rows and 32 columnsThe describe function gives information about the mean,standard deviation, median and many other metrics of each column in the data set.```{r}describe(hotel_bookings)```The summary of entire hotel_bookings```{r}summary(hotel_bookings)```different distinct data present in the columns```{r}distinct(hotel_bookings,hotel) ```Two different data is present in hotel - resort and city```{r}distinct(hotel_bookings,reservation_status) distinct(hotel_bookings,is_canceled)distinct(hotel_bookings,arrival_date_year)```reservation status consists of 3 types of data, namely : check-out,cancelled, No-showis cancelled has boolean data, and it has 0,1 where as arrival date_year has 3 columns```{r}table(hotel_bookings$arrival_date_month)```we see that there are highest number of bookings in the month of may- which gives an idea that summer is the season where there are maximum number of bookings```{r}table(hotel_bookings$arrival_date_year)```we see that there are maximum bookings in 2016 ## Provide Grouped Summary StatisticsConduct some exploratory data analysis, using dplyr commands such as `group_by()`, `select()`, `filter()`, and `summarise()`. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.```{r}hotel_bookings %>% dplyr::group_by(hotel) %>%summary()``````{r}a <- hotel_bookings%>%group_by('reservation_status')summary(a)``````{r}a <- hotel_bookings%>%group_by('is_canceled')summary(a)``````{r}print(summarytools::dfSummary(hotel_bookings,varnumbers =FALSE,plain.ascii =FALSE, style ="grid", graph.magnif =0.70, valid.col =FALSE),method ='render',table.classes ='table-condensed')``````{r}a <- hotel_bookings%>%group_by(hotel)%>%count(reserved_room_type)%>%pivot_wider(names_from= hotel, values_from = n)a```More number of reservations are made in city hotel for the room type of A and the same is observed in resort hotel aswell.Least number of reservations are made for the room type of P.We do not observe the types of H and L in room type of Lwe can say that there are no-missing values present in the data, yet there are duplicates.```{r}a <- hotel_bookings%>%group_by(hotel)%>%count(is_canceled)%>%pivot_wider(names_from= hotel, values_from = n)a```For the hotel type here , we analyse the cancellations for resort and city hotel types once the booking is made.we see that the % people who cancelled the booking/ didn't show is more in city hotel is about 41.7% and resort hotel is about 27.7%