DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 2

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Read in the Data

Challenge 2

  • Show All Code
  • Hide All Code

  • View Source
challenge_2
railroads
hotel_bookings
Author

Michaela Bowen

Published

October 11, 2022

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read in the Data

  • railroad*.csv or StateCounty2012.xls ⭐
  • hotel_bookings.csv ⭐⭐⭐⭐
  • RailRoads
  • Hotel Bookings

Here I have read in the RailRoads Excel File Sheet. Extraneous and empty data has been removed, and Canada had its county named “Canada”.

Code
library(readxl)
railroads <-read_excel("_data/StateCounty2012.xls",
                     skip = 4,
                     col_names= c("state", "delete",  "county",
                                  "delete", "employees"))%>%
  select(!contains("delete"))%>%
  filter(!str_detect(state, "Total"))

railroads <-head(railroads, -2)%>%
  mutate(county = ifelse(state=="CANADA", "CANADA", county))

railroads
# A tibble: 2,931 × 3
   state county               employees
   <chr> <chr>                    <dbl>
 1 AE    APO                          2
 2 AK    ANCHORAGE                    7
 3 AK    FAIRBANKS NORTH STAR         2
 4 AK    JUNEAU                       3
 5 AK    MATANUSKA-SUSITNA            2
 6 AK    SITKA                        1
 7 AK    SKAGWAY MUNICIPALITY        88
 8 AL    AUTAUGA                    102
 9 AL    BALDWIN                    143
10 AL    BARBOUR                      1
# … with 2,921 more rows

Describe the data

The variables are named State, County, and Employees. This is likely the data of railroad employees working at counties within the United States. There are 54 distinct “States” which include the District of Columbia and Canada, the other two are likely overseas or military.

Code
railroads%>%
  summarise(across(c(state,county), n_distinct))
# A tibble: 1 × 2
  state county
  <int>  <int>
1    54   1710
Code
railroads%>%
  select(state)%>%
  unique()
# A tibble: 54 × 1
   state
   <chr>
 1 AE   
 2 AK   
 3 AL   
 4 AP   
 5 AR   
 6 AZ   
 7 CA   
 8 CO   
 9 CT   
10 DC   
# … with 44 more rows

Summarizing across both columns we can see there are 54 distinct “states” and 1710 “counties”, this means that there are multiple instances of counties throughout the United States, with some states having the same county names as others.

I was also able to determine the entire number of railroad employees across all stations. When grouping by state, we are able to determine how many employees worked at railroad stations by state when this data was collected. I have grouped the data by state, and made it so we can see the total, mean, median, max and min employees by state. We can see in Texas, the state with the largest number of railroad employees, has only 1 employee at the least populated station and 4,235 employees at the most poulated. The average amount of employees at a texas railroad station is about 90 people.

Code
#giving the number of distinct counties and states
railroads%>%
  summarise(across(c(state,county), n_distinct))
# A tibble: 1 × 2
  state county
  <int>  <int>
1    54   1710
Code
#getting the total employees
railroads%>%
  summarize(total_employees = sum(employees))
# A tibble: 1 × 1
  total_employees
            <dbl>
1          256094
Code
#sum, mean, min, and max employess for each state
railroads%>%
  group_by(state)%>%
  summarize(total_state_employees = sum(employees), mean_employees = mean(employees), min_employees = min(employees), max_employees= max(employees))%>%
  arrange(desc(total_state_employees))
# A tibble: 54 × 5
   state total_state_employees mean_employees min_employees max_employees
   <chr>                 <dbl>          <dbl>         <dbl>         <dbl>
 1 TX                    19839           89.8             1          4235
 2 IL                    19131          186.              1          8207
 3 NY                    17050          280.              5          3685
 4 NE                    13176          148.              1          3797
 5 CA                    13137          239.              1          2888
 6 PA                    12769          196.              3          1649
 7 OH                     9056          103.              3           842
 8 GA                     8605           56.6             1           878
 9 IN                     8537           92.8             3          1999
10 MO                     8419           73.2             1          2055
# … with 44 more rows

Here I am attempting to read in and glean any information from this DataSet for Practice. Below I have Read in the data, and put it into a summary to better understand the data

Code
library(readxl)
library(lubridate)
hotel_bookings <- read_csv("_data/hotel_bookings.csv")
print(summarytools::dfSummary(hotel_bookings,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

hotel_bookings

Dimensions: 119390 x 32
Duplicates: 31994
Variable Stats / Values Freqs (% of Valid) Graph Missing
hotel [character]
1. City Hotel
2. Resort Hotel
79330(66.4%)
40060(33.6%)
0 (0.0%)
is_canceled [numeric]
Min : 0
Mean : 0.4
Max : 1
0:75166(63.0%)
1:44224(37.0%)
0 (0.0%)
lead_time [numeric]
Mean (sd) : 104 (106.9)
min ≤ med ≤ max:
0 ≤ 69 ≤ 737
IQR (CV) : 142 (1)
479 distinct values 0 (0.0%)
arrival_date_year [numeric]
Mean (sd) : 2016.2 (0.7)
min ≤ med ≤ max:
2015 ≤ 2016 ≤ 2017
IQR (CV) : 1 (0)
2015:21996(18.4%)
2016:56707(47.5%)
2017:40687(34.1%)
0 (0.0%)
arrival_date_month [character]
1. August
2. July
3. May
4. October
5. April
6. June
7. September
8. March
9. February
10. November
[ 2 others ]
13877(11.6%)
12661(10.6%)
11791(9.9%)
11160(9.3%)
11089(9.3%)
10939(9.2%)
10508(8.8%)
9794(8.2%)
8068(6.8%)
6794(5.7%)
12709(10.6%)
0 (0.0%)
arrival_date_week_number [numeric]
Mean (sd) : 27.2 (13.6)
min ≤ med ≤ max:
1 ≤ 28 ≤ 53
IQR (CV) : 22 (0.5)
53 distinct values 0 (0.0%)
arrival_date_day_of_month [numeric]
Mean (sd) : 15.8 (8.8)
min ≤ med ≤ max:
1 ≤ 16 ≤ 31
IQR (CV) : 15 (0.6)
31 distinct values 0 (0.0%)
stays_in_weekend_nights [numeric]
Mean (sd) : 0.9 (1)
min ≤ med ≤ max:
0 ≤ 1 ≤ 19
IQR (CV) : 2 (1.1)
17 distinct values 0 (0.0%)
stays_in_week_nights [numeric]
Mean (sd) : 2.5 (1.9)
min ≤ med ≤ max:
0 ≤ 2 ≤ 50
IQR (CV) : 2 (0.8)
35 distinct values 0 (0.0%)
adults [numeric]
Mean (sd) : 1.9 (0.6)
min ≤ med ≤ max:
0 ≤ 2 ≤ 55
IQR (CV) : 0 (0.3)
14 distinct values 0 (0.0%)
children [numeric]
Mean (sd) : 0.1 (0.4)
min ≤ med ≤ max:
0 ≤ 0 ≤ 10
IQR (CV) : 0 (3.8)
0:110796(92.8%)
1:4861(4.1%)
2:3652(3.1%)
3:76(0.1%)
10:1(0.0%)
4 (0.0%)
babies [numeric]
Mean (sd) : 0 (0.1)
min ≤ med ≤ max:
0 ≤ 0 ≤ 10
IQR (CV) : 0 (12.3)
0:118473(99.2%)
1:900(0.8%)
2:15(0.0%)
9:1(0.0%)
10:1(0.0%)
0 (0.0%)
meal [character]
1. BB
2. FB
3. HB
4. SC
5. Undefined
92310(77.3%)
798(0.7%)
14463(12.1%)
10650(8.9%)
1169(1.0%)
0 (0.0%)
country [character]
1. PRT
2. GBR
3. FRA
4. ESP
5. DEU
6. ITA
7. IRL
8. BEL
9. BRA
10. NLD
[ 168 others ]
48590(40.7%)
12129(10.2%)
10415(8.7%)
8568(7.2%)
7287(6.1%)
3766(3.2%)
3375(2.8%)
2342(2.0%)
2224(1.9%)
2104(1.8%)
18590(15.6%)
0 (0.0%)
market_segment [character]
1. Aviation
2. Complementary
3. Corporate
4. Direct
5. Groups
6. Offline TA/TO
7. Online TA
8. Undefined
237(0.2%)
743(0.6%)
5295(4.4%)
12606(10.6%)
19811(16.6%)
24219(20.3%)
56477(47.3%)
2(0.0%)
0 (0.0%)
distribution_channel [character]
1. Corporate
2. Direct
3. GDS
4. TA/TO
5. Undefined
6677(5.6%)
14645(12.3%)
193(0.2%)
97870(82.0%)
5(0.0%)
0 (0.0%)
is_repeated_guest [numeric]
Min : 0
Mean : 0
Max : 1
0:115580(96.8%)
1:3810(3.2%)
0 (0.0%)
previous_cancellations [numeric]
Mean (sd) : 0.1 (0.8)
min ≤ med ≤ max:
0 ≤ 0 ≤ 26
IQR (CV) : 0 (9.7)
15 distinct values 0 (0.0%)
previous_bookings_not_canceled [numeric]
Mean (sd) : 0.1 (1.5)
min ≤ med ≤ max:
0 ≤ 0 ≤ 72
IQR (CV) : 0 (10.9)
73 distinct values 0 (0.0%)
reserved_room_type [character]
1. A
2. B
3. C
4. D
5. E
6. F
7. G
8. H
9. L
10. P
85994(72.0%)
1118(0.9%)
932(0.8%)
19201(16.1%)
6535(5.5%)
2897(2.4%)
2094(1.8%)
601(0.5%)
6(0.0%)
12(0.0%)
0 (0.0%)
assigned_room_type [character]
1. A
2. D
3. E
4. F
5. G
6. C
7. B
8. H
9. I
10. K
[ 2 others ]
74053(62.0%)
25322(21.2%)
7806(6.5%)
3751(3.1%)
2553(2.1%)
2375(2.0%)
2163(1.8%)
712(0.6%)
363(0.3%)
279(0.2%)
13(0.0%)
0 (0.0%)
booking_changes [numeric]
Mean (sd) : 0.2 (0.7)
min ≤ med ≤ max:
0 ≤ 0 ≤ 21
IQR (CV) : 0 (2.9)
21 distinct values 0 (0.0%)
deposit_type [character]
1. No Deposit
2. Non Refund
3. Refundable
104641(87.6%)
14587(12.2%)
162(0.1%)
0 (0.0%)
agent [character]
1. 9
2. NULL
3. 240
4. 1
5. 14
6. 7
7. 6
8. 250
9. 241
10. 28
[ 324 others ]
31961(26.8%)
16340(13.7%)
13922(11.7%)
7191(6.0%)
3640(3.0%)
3539(3.0%)
3290(2.8%)
2870(2.4%)
1721(1.4%)
1666(1.4%)
33250(27.8%)
0 (0.0%)
company [character]
1. NULL
2. 40
3. 223
4. 67
5. 45
6. 153
7. 174
8. 219
9. 281
10. 154
[ 343 others ]
112593(94.3%)
927(0.8%)
784(0.7%)
267(0.2%)
250(0.2%)
215(0.2%)
149(0.1%)
141(0.1%)
138(0.1%)
133(0.1%)
3793(3.2%)
0 (0.0%)
days_in_waiting_list [numeric]
Mean (sd) : 2.3 (17.6)
min ≤ med ≤ max:
0 ≤ 0 ≤ 391
IQR (CV) : 0 (7.6)
128 distinct values 0 (0.0%)
customer_type [character]
1. Contract
2. Group
3. Transient
4. Transient-Party
4076(3.4%)
577(0.5%)
89613(75.1%)
25124(21.0%)
0 (0.0%)
adr [numeric]
Mean (sd) : 101.8 (50.5)
min ≤ med ≤ max:
-6.4 ≤ 94.6 ≤ 5400
IQR (CV) : 56.7 (0.5)
8879 distinct values 0 (0.0%)
required_car_parking_spaces [numeric]
Mean (sd) : 0.1 (0.2)
min ≤ med ≤ max:
0 ≤ 0 ≤ 8
IQR (CV) : 0 (3.9)
0:111974(93.8%)
1:7383(6.2%)
2:28(0.0%)
3:3(0.0%)
8:2(0.0%)
0 (0.0%)
total_of_special_requests [numeric]
Mean (sd) : 0.6 (0.8)
min ≤ med ≤ max:
0 ≤ 0 ≤ 5
IQR (CV) : 1 (1.4)
0:70318(58.9%)
1:33226(27.8%)
2:12969(10.9%)
3:2497(2.1%)
4:340(0.3%)
5:40(0.0%)
0 (0.0%)
reservation_status [character]
1. Canceled
2. Check-Out
3. No-Show
43017(36.0%)
75166(63.0%)
1207(1.0%)
0 (0.0%)
reservation_status_date [Date]
min : 2014-10-17
med : 2016-08-07
max : 2017-09-14
range : 2y 10m 28d
926 distinct values 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-10-12

Initially after looking throught this data set I thought there was a lot of infomration, but because this is an earlier challenge I wanted to keep it preliminary here and work more with this data set in challenge 4 with the lubridate package

I was able to determine that there are only two hotels in this data set the City Hotel and the Resort Hotel. Below I was able to determine the average lead time by month for each hotel

Code
#Resort Hotel Average lead time by month
hotel_bookings%>%
  group_by(arrival_date_month)%>%
  filter(is_canceled == 0, hotel == "Resort Hotel")%>%
  mutate(avg_lead_time_by_month = mean(lead_time))%>%
  select(hotel, arrival_date_month, avg_lead_time_by_month)%>%
  distinct()
# A tibble: 12 × 3
# Groups:   arrival_date_month [12]
   hotel        arrival_date_month avg_lead_time_by_month
   <chr>        <chr>                               <dbl>
 1 Resort Hotel July                                 96.2
 2 Resort Hotel August                               92.8
 3 Resort Hotel September                           146. 
 4 Resort Hotel October                             103. 
 5 Resort Hotel November                             32.3
 6 Resort Hotel December                             44.9
 7 Resort Hotel January                              19.7
 8 Resort Hotel February                             27.6
 9 Resort Hotel March                                52.4
10 Resort Hotel April                                73.6
11 Resort Hotel May                                 107. 
12 Resort Hotel June                                126. 
Code
#City Hotel Average lead time by month
hotel_bookings%>%
  group_by(arrival_date_month)%>%
  filter(is_canceled == 0, hotel == "City Hotel")%>%
  mutate(avg_lead_time_by_month = mean(lead_time))%>%
  select(hotel, arrival_date_month, avg_lead_time_by_month)%>%
  distinct()
# A tibble: 12 × 3
# Groups:   arrival_date_month [12]
   hotel      arrival_date_month avg_lead_time_by_month
   <chr>      <chr>                               <dbl>
 1 City Hotel July                                133. 
 2 City Hotel August                              101. 
 3 City Hotel September                            81.8
 4 City Hotel October                              83.6
 5 City Hotel November                             53.7
 6 City Hotel December                             51.3
 7 City Hotel January                              32.4
 8 City Hotel February                             36.3
 9 City Hotel March                                53.4
10 City Hotel April                                75.9
11 City Hotel May                                  90.0
12 City Hotel June                                103. 
Source Code
---
title: "Challenge 2"
author: "Michaela Bowen"
desription: "Data wrangling: using group() and summarise()"
date: "10/11/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_2
  - railroads
  - hotel_bookings
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```


## Read in the Data

-   railroad\*.csv or StateCounty2012.xls ⭐
-   hotel_bookings.csv ⭐⭐⭐⭐

:::panel-tabset
# RailRoads
Here I have read in the RailRoads Excel File Sheet. Extraneous and empty data has been removed, and Canada had its county named "Canada". 


```{r}
library(readxl)
railroads <-read_excel("_data/StateCounty2012.xls",
                     skip = 4,
                     col_names= c("state", "delete",  "county",
                                  "delete", "employees"))%>%
  select(!contains("delete"))%>%
  filter(!str_detect(state, "Total"))

railroads <-head(railroads, -2)%>%
  mutate(county = ifelse(state=="CANADA", "CANADA", county))

railroads
```

## Describe the data

The variables are named *State*, *County*, and *Employees*. This is likely the data of railroad employees working at counties within the United States. There are 54 distinct "States" which include the District of Columbia and Canada, the other two are likely overseas or military. 

```{r}
#| label: summary
railroads%>%
  summarise(across(c(state,county), n_distinct))


railroads%>%
  select(state)%>%
  unique()

```


Summarizing across both columns we can see there are 54 distinct "states" and 1710 "counties", this means that there are multiple instances of counties throughout the United States, with some states having the same county names as others.

I was also able to determine the entire number of railroad employees across all stations. When grouping by state, we are able to determine how many employees worked at railroad stations by state when this data was collected. I have grouped the data by state, and made it so we can see the total, mean, median, max and min employees by state. We can see in Texas, the state with the largest number of railroad employees, has only 1 employee at the least populated station and 4,235 employees at the most poulated. The average amount of employees at a texas railroad station is about 90 people. 

```{r}
#giving the number of distinct counties and states
railroads%>%
  summarise(across(c(state,county), n_distinct))
#getting the total employees
railroads%>%
  summarize(total_employees = sum(employees))

            
#sum, mean, min, and max employess for each state
railroads%>%
  group_by(state)%>%
  summarize(total_state_employees = sum(employees), mean_employees = mean(employees), min_employees = min(employees), max_employees= max(employees))%>%
  arrange(desc(total_state_employees))

```
# Hotel Bookings

Here I am attempting to read in and glean any information from this DataSet for Practice. Below I have Read in the data, and put it into a summary to better understand the data
```{r}
library(readxl)
library(lubridate)
hotel_bookings <- read_csv("_data/hotel_bookings.csv")
print(summarytools::dfSummary(hotel_bookings,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')
```
Initially after looking throught this data set I thought there was a lot of infomration, but because this is an earlier challenge I wanted to keep it preliminary here and work more with this data set in challenge 4 with the lubridate package

I was able to determine that there are only two hotels in this data set the City Hotel and the Resort Hotel.
Below I was able to determine the average lead time by month for each hotel
```{r}
#Resort Hotel Average lead time by month
hotel_bookings%>%
  group_by(arrival_date_month)%>%
  filter(is_canceled == 0, hotel == "Resort Hotel")%>%
  mutate(avg_lead_time_by_month = mean(lead_time))%>%
  select(hotel, arrival_date_month, avg_lead_time_by_month)%>%
  distinct()

#City Hotel Average lead time by month
hotel_bookings%>%
  group_by(arrival_date_month)%>%
  filter(is_canceled == 0, hotel == "City Hotel")%>%
  mutate(avg_lead_time_by_month = mean(lead_time))%>%
  select(hotel, arrival_date_month, avg_lead_time_by_month)%>%
  distinct()
```


:::