Code
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)Michaela Bowen
October 11, 2022
Here I have read in the RailRoads Excel File Sheet. Extraneous and empty data has been removed, and Canada had its county named “Canada”.
library(readxl)
railroads <-read_excel("_data/StateCounty2012.xls",
                     skip = 4,
                     col_names= c("state", "delete",  "county",
                                  "delete", "employees"))%>%
  select(!contains("delete"))%>%
  filter(!str_detect(state, "Total"))
railroads <-head(railroads, -2)%>%
  mutate(county = ifelse(state=="CANADA", "CANADA", county))
railroads# A tibble: 2,931 × 3
   state county               employees
   <chr> <chr>                    <dbl>
 1 AE    APO                          2
 2 AK    ANCHORAGE                    7
 3 AK    FAIRBANKS NORTH STAR         2
 4 AK    JUNEAU                       3
 5 AK    MATANUSKA-SUSITNA            2
 6 AK    SITKA                        1
 7 AK    SKAGWAY MUNICIPALITY        88
 8 AL    AUTAUGA                    102
 9 AL    BALDWIN                    143
10 AL    BARBOUR                      1
# … with 2,921 more rowsThe variables are named State, County, and Employees. This is likely the data of railroad employees working at counties within the United States. There are 54 distinct “States” which include the District of Columbia and Canada, the other two are likely overseas or military.
# A tibble: 1 × 2
  state county
  <int>  <int>
1    54   1710# A tibble: 54 × 1
   state
   <chr>
 1 AE   
 2 AK   
 3 AL   
 4 AP   
 5 AR   
 6 AZ   
 7 CA   
 8 CO   
 9 CT   
10 DC   
# … with 44 more rowsSummarizing across both columns we can see there are 54 distinct “states” and 1710 “counties”, this means that there are multiple instances of counties throughout the United States, with some states having the same county names as others.
I was also able to determine the entire number of railroad employees across all stations. When grouping by state, we are able to determine how many employees worked at railroad stations by state when this data was collected. I have grouped the data by state, and made it so we can see the total, mean, median, max and min employees by state. We can see in Texas, the state with the largest number of railroad employees, has only 1 employee at the least populated station and 4,235 employees at the most poulated. The average amount of employees at a texas railroad station is about 90 people.
# A tibble: 1 × 2
  state county
  <int>  <int>
1    54   1710# A tibble: 1 × 1
  total_employees
            <dbl>
1          256094# A tibble: 54 × 5
   state total_state_employees mean_employees min_employees max_employees
   <chr>                 <dbl>          <dbl>         <dbl>         <dbl>
 1 TX                    19839           89.8             1          4235
 2 IL                    19131          186.              1          8207
 3 NY                    17050          280.              5          3685
 4 NE                    13176          148.              1          3797
 5 CA                    13137          239.              1          2888
 6 PA                    12769          196.              3          1649
 7 OH                     9056          103.              3           842
 8 GA                     8605           56.6             1           878
 9 IN                     8537           92.8             3          1999
10 MO                     8419           73.2             1          2055
# … with 44 more rowsHere I am attempting to read in and glean any information from this DataSet for Practice. Below I have Read in the data, and put it into a summary to better understand the data
| Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| hotel [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| is_canceled [numeric] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| lead_time [numeric] | 
 | 479 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| arrival_date_year [numeric] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| arrival_date_month [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| arrival_date_week_number [numeric] | 
 | 53 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| arrival_date_day_of_month [numeric] | 
 | 31 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| stays_in_weekend_nights [numeric] | 
 | 17 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| stays_in_week_nights [numeric] | 
 | 35 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| adults [numeric] | 
 | 14 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| children [numeric] | 
 | 
 | 4 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| babies [numeric] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| meal [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| country [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| market_segment [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| distribution_channel [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| is_repeated_guest [numeric] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| previous_cancellations [numeric] | 
 | 15 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| previous_bookings_not_canceled [numeric] | 
 | 73 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| reserved_room_type [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| assigned_room_type [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| booking_changes [numeric] | 
 | 21 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| deposit_type [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| agent [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| company [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| days_in_waiting_list [numeric] | 
 | 128 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| customer_type [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| adr [numeric] | 
 | 8879 distinct values | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| required_car_parking_spaces [numeric] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| total_of_special_requests [numeric] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| reservation_status [character] | 
 | 
 | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| reservation_status_date [Date] | 
 | 926 distinct values | 0 (0.0%) | 
Generated by summarytools 1.0.1 (R version 4.2.1)
2022-10-12
Initially after looking throught this data set I thought there was a lot of infomration, but because this is an earlier challenge I wanted to keep it preliminary here and work more with this data set in challenge 4 with the lubridate package
I was able to determine that there are only two hotels in this data set the City Hotel and the Resort Hotel. Below I was able to determine the average lead time by month for each hotel
# A tibble: 12 × 3
# Groups:   arrival_date_month [12]
   hotel        arrival_date_month avg_lead_time_by_month
   <chr>        <chr>                               <dbl>
 1 Resort Hotel July                                 96.2
 2 Resort Hotel August                               92.8
 3 Resort Hotel September                           146. 
 4 Resort Hotel October                             103. 
 5 Resort Hotel November                             32.3
 6 Resort Hotel December                             44.9
 7 Resort Hotel January                              19.7
 8 Resort Hotel February                             27.6
 9 Resort Hotel March                                52.4
10 Resort Hotel April                                73.6
11 Resort Hotel May                                 107. 
12 Resort Hotel June                                126. # A tibble: 12 × 3
# Groups:   arrival_date_month [12]
   hotel      arrival_date_month avg_lead_time_by_month
   <chr>      <chr>                               <dbl>
 1 City Hotel July                                133. 
 2 City Hotel August                              101. 
 3 City Hotel September                            81.8
 4 City Hotel October                              83.6
 5 City Hotel November                             53.7
 6 City Hotel December                             51.3
 7 City Hotel January                              32.4
 8 City Hotel February                             36.3
 9 City Hotel March                                53.4
10 City Hotel April                                75.9
11 City Hotel May                                  90.0
12 City Hotel June                                103. ---
title: "Challenge 2"
author: "Michaela Bowen"
desription: "Data wrangling: using group() and summarise()"
date: "10/11/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_2
  - railroads
  - hotel_bookings
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Read in the Data
-   railroad\*.csv or StateCounty2012.xls ⭐
-   hotel_bookings.csv ⭐⭐⭐⭐
:::panel-tabset
# RailRoads
Here I have read in the RailRoads Excel File Sheet. Extraneous and empty data has been removed, and Canada had its county named "Canada". 
```{r}
library(readxl)
railroads <-read_excel("_data/StateCounty2012.xls",
                     skip = 4,
                     col_names= c("state", "delete",  "county",
                                  "delete", "employees"))%>%
  select(!contains("delete"))%>%
  filter(!str_detect(state, "Total"))
railroads <-head(railroads, -2)%>%
  mutate(county = ifelse(state=="CANADA", "CANADA", county))
railroads
```
## Describe the data
The variables are named *State*, *County*, and *Employees*. This is likely the data of railroad employees working at counties within the United States. There are 54 distinct "States" which include the District of Columbia and Canada, the other two are likely overseas or military. 
```{r}
#| label: summary
railroads%>%
  summarise(across(c(state,county), n_distinct))
railroads%>%
  select(state)%>%
  unique()
```
Summarizing across both columns we can see there are 54 distinct "states" and 1710 "counties", this means that there are multiple instances of counties throughout the United States, with some states having the same county names as others.
I was also able to determine the entire number of railroad employees across all stations. When grouping by state, we are able to determine how many employees worked at railroad stations by state when this data was collected. I have grouped the data by state, and made it so we can see the total, mean, median, max and min employees by state. We can see in Texas, the state with the largest number of railroad employees, has only 1 employee at the least populated station and 4,235 employees at the most poulated. The average amount of employees at a texas railroad station is about 90 people. 
```{r}
#giving the number of distinct counties and states
railroads%>%
  summarise(across(c(state,county), n_distinct))
#getting the total employees
railroads%>%
  summarize(total_employees = sum(employees))
            
#sum, mean, min, and max employess for each state
railroads%>%
  group_by(state)%>%
  summarize(total_state_employees = sum(employees), mean_employees = mean(employees), min_employees = min(employees), max_employees= max(employees))%>%
  arrange(desc(total_state_employees))
```
# Hotel Bookings
Here I am attempting to read in and glean any information from this DataSet for Practice. Below I have Read in the data, and put it into a summary to better understand the data
```{r}
library(readxl)
library(lubridate)
hotel_bookings <- read_csv("_data/hotel_bookings.csv")
print(summarytools::dfSummary(hotel_bookings,
                        varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')
```
Initially after looking throught this data set I thought there was a lot of infomration, but because this is an earlier challenge I wanted to keep it preliminary here and work more with this data set in challenge 4 with the lubridate package
I was able to determine that there are only two hotels in this data set the City Hotel and the Resort Hotel.
Below I was able to determine the average lead time by month for each hotel
```{r}
#Resort Hotel Average lead time by month
hotel_bookings%>%
  group_by(arrival_date_month)%>%
  filter(is_canceled == 0, hotel == "Resort Hotel")%>%
  mutate(avg_lead_time_by_month = mean(lead_time))%>%
  select(hotel, arrival_date_month, avg_lead_time_by_month)%>%
  distinct()
#City Hotel Average lead time by month
hotel_bookings%>%
  group_by(arrival_date_month)%>%
  filter(is_canceled == 0, hotel == "City Hotel")%>%
  mutate(avg_lead_time_by_month = mean(lead_time))%>%
  select(hotel, arrival_date_month, avg_lead_time_by_month)%>%
  distinct()
```
:::