Challenge 2

challenge_2

railroads

hotel_bookings

Author

Michaela Bowen

Published

October 11, 2022

Code

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read in the Data

railroad*.csv or StateCounty2012.xls ⭐
hotel_bookings.csv ⭐⭐⭐⭐

RailRoads
Hotel Bookings

Here I have read in the RailRoads Excel File Sheet. Extraneous and empty data has been removed, and Canada had its county named “Canada”.

Code

library(readxl)
railroads <-read_excel("_data/StateCounty2012.xls",
                     skip = 4,
                     col_names= c("state", "delete",  "county",
                                  "delete", "employees"))%>%
  select(!contains("delete"))%>%
  filter(!str_detect(state, "Total"))

railroads <-head(railroads, -2)%>%
  mutate(county = ifelse(state=="CANADA", "CANADA", county))

railroads

# A tibble: 2,931 × 3
   state county               employees
   <chr> <chr>                    <dbl>
 1 AE    APO                          2
 2 AK    ANCHORAGE                    7
 3 AK    FAIRBANKS NORTH STAR         2
 4 AK    JUNEAU                       3
 5 AK    MATANUSKA-SUSITNA            2
 6 AK    SITKA                        1
 7 AK    SKAGWAY MUNICIPALITY        88
 8 AL    AUTAUGA                    102
 9 AL    BALDWIN                    143
10 AL    BARBOUR                      1
# … with 2,921 more rows

Describe the data

The variables are named State, County, and Employees. This is likely the data of railroad employees working at counties within the United States. There are 54 distinct “States” which include the District of Columbia and Canada, the other two are likely overseas or military.

Code

railroads%>%
  summarise(across(c(state,county), n_distinct))

# A tibble: 1 × 2
  state county
  <int>  <int>
1    54   1710

Code

railroads%>%
  select(state)%>%
  unique()

# A tibble: 54 × 1
   state
   <chr>
 1 AE   
 2 AK   
 3 AL   
 4 AP   
 5 AR   
 6 AZ   
 7 CA   
 8 CO   
 9 CT   
10 DC   
# … with 44 more rows

Summarizing across both columns we can see there are 54 distinct “states” and 1710 “counties”, this means that there are multiple instances of counties throughout the United States, with some states having the same county names as others.

I was also able to determine the entire number of railroad employees across all stations. When grouping by state, we are able to determine how many employees worked at railroad stations by state when this data was collected. I have grouped the data by state, and made it so we can see the total, mean, median, max and min employees by state. We can see in Texas, the state with the largest number of railroad employees, has only 1 employee at the least populated station and 4,235 employees at the most poulated. The average amount of employees at a texas railroad station is about 90 people.

Code

#giving the number of distinct counties and states
railroads%>%
  summarise(across(c(state,county), n_distinct))

# A tibble: 1 × 2
  state county
  <int>  <int>
1    54   1710

Code

#getting the total employees
railroads%>%
  summarize(total_employees = sum(employees))

# A tibble: 1 × 1
  total_employees
            <dbl>
1          256094

Code

#sum, mean, min, and max employess for each state
railroads%>%
  group_by(state)%>%
  summarize(total_state_employees = sum(employees), mean_employees = mean(employees), min_employees = min(employees), max_employees= max(employees))%>%
  arrange(desc(total_state_employees))

# A tibble: 54 × 5
   state total_state_employees mean_employees min_employees max_employees
   <chr>                 <dbl>          <dbl>         <dbl>         <dbl>
 1 TX                    19839           89.8             1          4235
 2 IL                    19131          186.              1          8207
 3 NY                    17050          280.              5          3685
 4 NE                    13176          148.              1          3797
 5 CA                    13137          239.              1          2888
 6 PA                    12769          196.              3          1649
 7 OH                     9056          103.              3           842
 8 GA                     8605           56.6             1           878
 9 IN                     8537           92.8             3          1999
10 MO                     8419           73.2             1          2055
# … with 44 more rows

Here I am attempting to read in and glean any information from this DataSet for Practice. Below I have Read in the data, and put it into a summary to better understand the data

Code

library(readxl)
library(lubridate)
hotel_bookings <- read_csv("_data/hotel_bookings.csv")
print(summarytools::dfSummary(hotel_bookings,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

hotel_bookings

Dimensions: 119390 x 32
Duplicates: 31994

Variable

Stats / Values

Freqs (% of Valid)

Graph

Missing

hotel [character]

1. City Hotel

2. Resort Hotel

79330	(	66.4%	)
40060	(	33.6%	)

0 (0.0%)

is_canceled [numeric]

Min : 0

Mean : 0.4

Max : 1

0	:	75166	(	63.0%	)
1	:	44224	(	37.0%	)

0 (0.0%)

lead_time [numeric]

Mean (sd) : 104 (106.9)

min ≤ med ≤ max:

0 ≤ 69 ≤ 737

IQR (CV) : 142 (1)

479 distinct values

0 (0.0%)

arrival_date_year [numeric]

Mean (sd) : 2016.2 (0.7)

min ≤ med ≤ max:

2015 ≤ 2016 ≤ 2017

IQR (CV) : 1 (0)

2015	:	21996	(	18.4%	)
2016	:	56707	(	47.5%	)
2017	:	40687	(	34.1%	)

0 (0.0%)

arrival_date_month [character]

1. August

2. July

3. May

4. October

5. April

6. June

7. September

8. March

9. February

10. November

[ 2 others ]

13877	(	11.6%	)
12661	(	10.6%	)
11791	(	9.9%	)
11160	(	9.3%	)
11089	(	9.3%	)
10939	(	9.2%	)
10508	(	8.8%	)
9794	(	8.2%	)
8068	(	6.8%	)
6794	(	5.7%	)
12709	(	10.6%	)

0 (0.0%)

arrival_date_week_number [numeric]

Mean (sd) : 27.2 (13.6)

min ≤ med ≤ max:

1 ≤ 28 ≤ 53

IQR (CV) : 22 (0.5)

53 distinct values

0 (0.0%)

arrival_date_day_of_month [numeric]

Mean (sd) : 15.8 (8.8)

min ≤ med ≤ max:

1 ≤ 16 ≤ 31

IQR (CV) : 15 (0.6)

31 distinct values

0 (0.0%)

stays_in_weekend_nights [numeric]

Mean (sd) : 0.9 (1)

min ≤ med ≤ max:

0 ≤ 1 ≤ 19

IQR (CV) : 2 (1.1)

17 distinct values

0 (0.0%)

stays_in_week_nights [numeric]

Mean (sd) : 2.5 (1.9)

min ≤ med ≤ max:

0 ≤ 2 ≤ 50

IQR (CV) : 2 (0.8)

35 distinct values

0 (0.0%)

adults [numeric]

Mean (sd) : 1.9 (0.6)

min ≤ med ≤ max:

0 ≤ 2 ≤ 55

IQR (CV) : 0 (0.3)

14 distinct values

0 (0.0%)

children [numeric]

Mean (sd) : 0.1 (0.4)

min ≤ med ≤ max:

0 ≤ 0 ≤ 10

IQR (CV) : 0 (3.8)

0	:	110796	(	92.8%	)
1	:	4861	(	4.1%	)
2	:	3652	(	3.1%	)
3	:	76	(	0.1%	)
10	:	1	(	0.0%	)

4 (0.0%)

babies [numeric]

Mean (sd) : 0 (0.1)

min ≤ med ≤ max:

0 ≤ 0 ≤ 10

IQR (CV) : 0 (12.3)

0	:	118473	(	99.2%	)
1	:	900	(	0.8%	)
2	:	15	(	0.0%	)
9	:	1	(	0.0%	)
10	:	1	(	0.0%	)

0 (0.0%)

meal [character]

1. BB

2. FB

3. HB

4. SC

5. Undefined

92310	(	77.3%	)
798	(	0.7%	)
14463	(	12.1%	)
10650	(	8.9%	)
1169	(	1.0%	)

0 (0.0%)

country [character]

1. PRT

2. GBR

3. FRA

4. ESP

5. DEU

6. ITA

7. IRL

8. BEL

9. BRA

10. NLD

[ 168 others ]

48590	(	40.7%	)
12129	(	10.2%	)
10415	(	8.7%	)
8568	(	7.2%	)
7287	(	6.1%	)
3766	(	3.2%	)
3375	(	2.8%	)
2342	(	2.0%	)
2224	(	1.9%	)
2104	(	1.8%	)
18590	(	15.6%	)

0 (0.0%)

market_segment [character]

1. Aviation

2. Complementary

3. Corporate

4. Direct

5. Groups

6. Offline TA/TO

7. Online TA

8. Undefined

237	(	0.2%	)
743	(	0.6%	)
5295	(	4.4%	)
12606	(	10.6%	)
19811	(	16.6%	)
24219	(	20.3%	)
56477	(	47.3%	)
2	(	0.0%	)

0 (0.0%)

distribution_channel [character]

1. Corporate

2. Direct

3. GDS

4. TA/TO

5. Undefined

6677	(	5.6%	)
14645	(	12.3%	)
193	(	0.2%	)
97870	(	82.0%	)
5	(	0.0%	)

0 (0.0%)

is_repeated_guest [numeric]

Min : 0

Mean : 0

Max : 1

0	:	115580	(	96.8%	)
1	:	3810	(	3.2%	)

0 (0.0%)

previous_cancellations [numeric]

Mean (sd) : 0.1 (0.8)

min ≤ med ≤ max:

0 ≤ 0 ≤ 26

IQR (CV) : 0 (9.7)

15 distinct values

0 (0.0%)

previous_bookings_not_canceled [numeric]

Mean (sd) : 0.1 (1.5)

min ≤ med ≤ max:

0 ≤ 0 ≤ 72

IQR (CV) : 0 (10.9)

73 distinct values

0 (0.0%)

reserved_room_type [character]

1. A

2. B

3. C

4. D

5. E

6. F

7. G

8. H

9. L

10. P

85994	(	72.0%	)
1118	(	0.9%	)
932	(	0.8%	)
19201	(	16.1%	)
6535	(	5.5%	)
2897	(	2.4%	)
2094	(	1.8%	)
601	(	0.5%	)
6	(	0.0%	)
12	(	0.0%	)

0 (0.0%)

assigned_room_type [character]

1. A

2. D

3. E

4. F

5. G

6. C

7. B

8. H

9. I

10. K

[ 2 others ]

74053	(	62.0%	)
25322	(	21.2%	)
7806	(	6.5%	)
3751	(	3.1%	)
2553	(	2.1%	)
2375	(	2.0%	)
2163	(	1.8%	)
712	(	0.6%	)
363	(	0.3%	)
279	(	0.2%	)
13	(	0.0%	)

0 (0.0%)

booking_changes [numeric]

Mean (sd) : 0.2 (0.7)

min ≤ med ≤ max:

0 ≤ 0 ≤ 21

IQR (CV) : 0 (2.9)

21 distinct values

0 (0.0%)

deposit_type [character]

1. No Deposit

2. Non Refund

3. Refundable

104641	(	87.6%	)
14587	(	12.2%	)
162	(	0.1%	)

0 (0.0%)

agent [character]

1. 9
2. NULL
3. 240
4. 1
5. 14
6. 7
7. 6
8. 250
9. 241
10. 28
[ 324 others ]

31961	(	26.8%	)
16340	(	13.7%	)
13922	(	11.7%	)
7191	(	6.0%	)
3640	(	3.0%	)
3539	(	3.0%	)
3290	(	2.8%	)
2870	(	2.4%	)
1721	(	1.4%	)
1666	(	1.4%	)
33250	(	27.8%	)

0 (0.0%)

company [character]

1. NULL
2. 40
3. 223
4. 67
5. 45
6. 153
7. 174
8. 219
9. 281
10. 154
[ 343 others ]

112593	(	94.3%	)
927	(	0.8%	)
784	(	0.7%	)
267	(	0.2%	)
250	(	0.2%	)
215	(	0.2%	)
149	(	0.1%	)
141	(	0.1%	)
138	(	0.1%	)
133	(	0.1%	)
3793	(	3.2%	)

0 (0.0%)

days_in_waiting_list [numeric]

Mean (sd) : 2.3 (17.6)

min ≤ med ≤ max:

0 ≤ 0 ≤ 391

IQR (CV) : 0 (7.6)

128 distinct values

0 (0.0%)

customer_type [character]

1. Contract

2. Group

3. Transient

4. Transient-Party

4076	(	3.4%	)
577	(	0.5%	)
89613	(	75.1%	)
25124	(	21.0%	)

0 (0.0%)

adr [numeric]

Mean (sd) : 101.8 (50.5)

min ≤ med ≤ max:

-6.4 ≤ 94.6 ≤ 5400

IQR (CV) : 56.7 (0.5)

8879 distinct values

0 (0.0%)

required_car_parking_spaces [numeric]

Mean (sd) : 0.1 (0.2)

min ≤ med ≤ max:

0 ≤ 0 ≤ 8

IQR (CV) : 0 (3.9)

0	:	111974	(	93.8%	)
1	:	7383	(	6.2%	)
2	:	28	(	0.0%	)
3	:	3	(	0.0%	)
8	:	2	(	0.0%	)

0 (0.0%)

total_of_special_requests [numeric]

Mean (sd) : 0.6 (0.8)

min ≤ med ≤ max:

0 ≤ 0 ≤ 5

IQR (CV) : 1 (1.4)

0	:	70318	(	58.9%	)
1	:	33226	(	27.8%	)
2	:	12969	(	10.9%	)
3	:	2497	(	2.1%	)
4	:	340	(	0.3%	)
5	:	40	(	0.0%	)

0 (0.0%)

reservation_status [character]

1. Canceled

2. Check-Out

3. No-Show

43017	(	36.0%	)
75166	(	63.0%	)
1207	(	1.0%	)

0 (0.0%)

reservation_status_date [Date]

min : 2014-10-17

med : 2016-08-07

max : 2017-09-14

range : 2y 10m 28d

926 distinct values

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-10-12

Initially after looking throught this data set I thought there was a lot of infomration, but because this is an earlier challenge I wanted to keep it preliminary here and work more with this data set in challenge 4 with the lubridate package

I was able to determine that there are only two hotels in this data set the City Hotel and the Resort Hotel. Below I was able to determine the average lead time by month for each hotel

Code

#Resort Hotel Average lead time by month
hotel_bookings%>%
  group_by(arrival_date_month)%>%
  filter(is_canceled == 0, hotel == "Resort Hotel")%>%
  mutate(avg_lead_time_by_month = mean(lead_time))%>%
  select(hotel, arrival_date_month, avg_lead_time_by_month)%>%
  distinct()

# A tibble: 12 × 3
# Groups:   arrival_date_month [12]
   hotel        arrival_date_month avg_lead_time_by_month
   <chr>        <chr>                               <dbl>
 1 Resort Hotel July                                 96.2
 2 Resort Hotel August                               92.8
 3 Resort Hotel September                           146. 
 4 Resort Hotel October                             103. 
 5 Resort Hotel November                             32.3
 6 Resort Hotel December                             44.9
 7 Resort Hotel January                              19.7
 8 Resort Hotel February                             27.6
 9 Resort Hotel March                                52.4
10 Resort Hotel April                                73.6
11 Resort Hotel May                                 107. 
12 Resort Hotel June                                126.

Code

#City Hotel Average lead time by month
hotel_bookings%>%
  group_by(arrival_date_month)%>%
  filter(is_canceled == 0, hotel == "City Hotel")%>%
  mutate(avg_lead_time_by_month = mean(lead_time))%>%
  select(hotel, arrival_date_month, avg_lead_time_by_month)%>%
  distinct()

# A tibble: 12 × 3
# Groups:   arrival_date_month [12]
   hotel      arrival_date_month avg_lead_time_by_month
   <chr>      <chr>                               <dbl>
 1 City Hotel July                                133. 
 2 City Hotel August                              101. 
 3 City Hotel September                            81.8
 4 City Hotel October                              83.6
 5 City Hotel November                             53.7
 6 City Hotel December                             51.3
 7 City Hotel January                              32.4
 8 City Hotel February                             36.3
 9 City Hotel March                                53.4
10 City Hotel April                                75.9
11 City Hotel May                                  90.0
12 City Hotel June                                103.