Code
library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Michaela Bowen
October 11, 2022
Here I have read in the RailRoads Excel File Sheet. Extraneous and empty data has been removed, and Canada had its county named “Canada”.
library(readxl)
railroads <-read_excel("_data/StateCounty2012.xls",
skip = 4,
col_names= c("state", "delete", "county",
"delete", "employees"))%>%
select(!contains("delete"))%>%
filter(!str_detect(state, "Total"))
railroads <-head(railroads, -2)%>%
mutate(county = ifelse(state=="CANADA", "CANADA", county))
railroads
# A tibble: 2,931 × 3
state county employees
<chr> <chr> <dbl>
1 AE APO 2
2 AK ANCHORAGE 7
3 AK FAIRBANKS NORTH STAR 2
4 AK JUNEAU 3
5 AK MATANUSKA-SUSITNA 2
6 AK SITKA 1
7 AK SKAGWAY MUNICIPALITY 88
8 AL AUTAUGA 102
9 AL BALDWIN 143
10 AL BARBOUR 1
# … with 2,921 more rows
The variables are named State, County, and Employees. This is likely the data of railroad employees working at counties within the United States. There are 54 distinct “States” which include the District of Columbia and Canada, the other two are likely overseas or military.
# A tibble: 1 × 2
state county
<int> <int>
1 54 1710
# A tibble: 54 × 1
state
<chr>
1 AE
2 AK
3 AL
4 AP
5 AR
6 AZ
7 CA
8 CO
9 CT
10 DC
# … with 44 more rows
Summarizing across both columns we can see there are 54 distinct “states” and 1710 “counties”, this means that there are multiple instances of counties throughout the United States, with some states having the same county names as others.
I was also able to determine the entire number of railroad employees across all stations. When grouping by state, we are able to determine how many employees worked at railroad stations by state when this data was collected. I have grouped the data by state, and made it so we can see the total, mean, median, max and min employees by state. We can see in Texas, the state with the largest number of railroad employees, has only 1 employee at the least populated station and 4,235 employees at the most poulated. The average amount of employees at a texas railroad station is about 90 people.
# A tibble: 1 × 2
state county
<int> <int>
1 54 1710
# A tibble: 1 × 1
total_employees
<dbl>
1 256094
# A tibble: 54 × 5
state total_state_employees mean_employees min_employees max_employees
<chr> <dbl> <dbl> <dbl> <dbl>
1 TX 19839 89.8 1 4235
2 IL 19131 186. 1 8207
3 NY 17050 280. 5 3685
4 NE 13176 148. 1 3797
5 CA 13137 239. 1 2888
6 PA 12769 196. 3 1649
7 OH 9056 103. 3 842
8 GA 8605 56.6 1 878
9 IN 8537 92.8 3 1999
10 MO 8419 73.2 1 2055
# … with 44 more rows
Here I am attempting to read in and glean any information from this DataSet for Practice. Below I have Read in the data, and put it into a summary to better understand the data
Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
hotel [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
is_canceled [numeric] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
lead_time [numeric] |
|
479 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
arrival_date_year [numeric] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
arrival_date_month [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
arrival_date_week_number [numeric] |
|
53 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
arrival_date_day_of_month [numeric] |
|
31 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
stays_in_weekend_nights [numeric] |
|
17 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
stays_in_week_nights [numeric] |
|
35 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
adults [numeric] |
|
14 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
children [numeric] |
|
|
4 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
babies [numeric] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
meal [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
country [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
market_segment [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
distribution_channel [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
is_repeated_guest [numeric] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
previous_cancellations [numeric] |
|
15 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
previous_bookings_not_canceled [numeric] |
|
73 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
reserved_room_type [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
assigned_room_type [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
booking_changes [numeric] |
|
21 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
deposit_type [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
agent [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
company [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
days_in_waiting_list [numeric] |
|
128 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
customer_type [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
adr [numeric] |
|
8879 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
required_car_parking_spaces [numeric] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
total_of_special_requests [numeric] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
reservation_status [character] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
reservation_status_date [Date] |
|
926 distinct values | 0 (0.0%) |
Generated by summarytools 1.0.1 (R version 4.2.1)
2022-10-12
Initially after looking throught this data set I thought there was a lot of infomration, but because this is an earlier challenge I wanted to keep it preliminary here and work more with this data set in challenge 4 with the lubridate package
I was able to determine that there are only two hotels in this data set the City Hotel and the Resort Hotel. Below I was able to determine the average lead time by month for each hotel
# A tibble: 12 × 3
# Groups: arrival_date_month [12]
hotel arrival_date_month avg_lead_time_by_month
<chr> <chr> <dbl>
1 Resort Hotel July 96.2
2 Resort Hotel August 92.8
3 Resort Hotel September 146.
4 Resort Hotel October 103.
5 Resort Hotel November 32.3
6 Resort Hotel December 44.9
7 Resort Hotel January 19.7
8 Resort Hotel February 27.6
9 Resort Hotel March 52.4
10 Resort Hotel April 73.6
11 Resort Hotel May 107.
12 Resort Hotel June 126.
# A tibble: 12 × 3
# Groups: arrival_date_month [12]
hotel arrival_date_month avg_lead_time_by_month
<chr> <chr> <dbl>
1 City Hotel July 133.
2 City Hotel August 101.
3 City Hotel September 81.8
4 City Hotel October 83.6
5 City Hotel November 53.7
6 City Hotel December 51.3
7 City Hotel January 32.4
8 City Hotel February 36.3
9 City Hotel March 53.4
10 City Hotel April 75.9
11 City Hotel May 90.0
12 City Hotel June 103.
---
title: "Challenge 2"
author: "Michaela Bowen"
desription: "Data wrangling: using group() and summarise()"
date: "10/11/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_2
- railroads
- hotel_bookings
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Read in the Data
- railroad\*.csv or StateCounty2012.xls ⭐
- hotel_bookings.csv ⭐⭐⭐⭐
:::panel-tabset
# RailRoads
Here I have read in the RailRoads Excel File Sheet. Extraneous and empty data has been removed, and Canada had its county named "Canada".
```{r}
library(readxl)
railroads <-read_excel("_data/StateCounty2012.xls",
skip = 4,
col_names= c("state", "delete", "county",
"delete", "employees"))%>%
select(!contains("delete"))%>%
filter(!str_detect(state, "Total"))
railroads <-head(railroads, -2)%>%
mutate(county = ifelse(state=="CANADA", "CANADA", county))
railroads
```
## Describe the data
The variables are named *State*, *County*, and *Employees*. This is likely the data of railroad employees working at counties within the United States. There are 54 distinct "States" which include the District of Columbia and Canada, the other two are likely overseas or military.
```{r}
#| label: summary
railroads%>%
summarise(across(c(state,county), n_distinct))
railroads%>%
select(state)%>%
unique()
```
Summarizing across both columns we can see there are 54 distinct "states" and 1710 "counties", this means that there are multiple instances of counties throughout the United States, with some states having the same county names as others.
I was also able to determine the entire number of railroad employees across all stations. When grouping by state, we are able to determine how many employees worked at railroad stations by state when this data was collected. I have grouped the data by state, and made it so we can see the total, mean, median, max and min employees by state. We can see in Texas, the state with the largest number of railroad employees, has only 1 employee at the least populated station and 4,235 employees at the most poulated. The average amount of employees at a texas railroad station is about 90 people.
```{r}
#giving the number of distinct counties and states
railroads%>%
summarise(across(c(state,county), n_distinct))
#getting the total employees
railroads%>%
summarize(total_employees = sum(employees))
#sum, mean, min, and max employess for each state
railroads%>%
group_by(state)%>%
summarize(total_state_employees = sum(employees), mean_employees = mean(employees), min_employees = min(employees), max_employees= max(employees))%>%
arrange(desc(total_state_employees))
```
# Hotel Bookings
Here I am attempting to read in and glean any information from this DataSet for Practice. Below I have Read in the data, and put it into a summary to better understand the data
```{r}
library(readxl)
library(lubridate)
hotel_bookings <- read_csv("_data/hotel_bookings.csv")
print(summarytools::dfSummary(hotel_bookings,
varnumbers = FALSE,
plain.ascii = FALSE,
style = "grid",
graph.magnif = 0.70,
valid.col = FALSE),
method = 'render',
table.classes = 'table-condensed')
```
Initially after looking throught this data set I thought there was a lot of infomration, but because this is an earlier challenge I wanted to keep it preliminary here and work more with this data set in challenge 4 with the lubridate package
I was able to determine that there are only two hotels in this data set the City Hotel and the Resort Hotel.
Below I was able to determine the average lead time by month for each hotel
```{r}
#Resort Hotel Average lead time by month
hotel_bookings%>%
group_by(arrival_date_month)%>%
filter(is_canceled == 0, hotel == "Resort Hotel")%>%
mutate(avg_lead_time_by_month = mean(lead_time))%>%
select(hotel, arrival_date_month, avg_lead_time_by_month)%>%
distinct()
#City Hotel Average lead time by month
hotel_bookings%>%
group_by(arrival_date_month)%>%
filter(is_canceled == 0, hotel == "City Hotel")%>%
mutate(avg_lead_time_by_month = mean(lead_time))%>%
select(hotel, arrival_date_month, avg_lead_time_by_month)%>%
distinct()
```
:::