DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 2 Guanua Tan

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Challenge Overview
  • Read in the Data
  • Describe the data
  • Provide Grouped Summary Statistics
    • Explain and Interpret

Challenge 2 Guanua Tan

  • Show All Code
  • Hide All Code

  • View Source
challenge_2
railroads
faostat
hotel_bookings
Author

Guanhua Tan

Published

September 19, 2022

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to

  1. read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
  2. provide summary statistics for different interesting groups within the data, and interpret those statistics

Read in the Data

Read in one (or more) of the following data sets, available in the posts/_data folder, using the correct R package and command.

  • railroad*.csv or StateCounty2012.xls ⭐
  • FAOstat*.csv or birds.csv ⭐⭐⭐
  • hotel_bookings.csv ⭐⭐⭐⭐

Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.

Describe the data

Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).

Provide Grouped Summary Statistics

Conduct some exploratory data analysis, using dplyr commands such as group_by(), select(), filter(), and summarise(). Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.

Explain and Interpret

Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.

Code
library(dplyr)
railroad<-read_csv("_data/railroad_2012_clean_county.csv")
railroad
# A tibble: 2,930 × 3
   state county               total_employees
   <chr> <chr>                          <dbl>
 1 AE    APO                                2
 2 AK    ANCHORAGE                          7
 3 AK    FAIRBANKS NORTH STAR               2
 4 AK    JUNEAU                             3
 5 AK    MATANUSKA-SUSITNA                  2
 6 AK    SITKA                              1
 7 AK    SKAGWAY MUNICIPALITY              88
 8 AL    AUTAUGA                          102
 9 AL    BALDWIN                          143
10 AL    BARBOUR                            1
# … with 2,920 more rows
Code
railroad %>%
  summary(railroad)
    state              county          total_employees  
 Length:2930        Length:2930        Min.   :   1.00  
 Class :character   Class :character   1st Qu.:   7.00  
 Mode  :character   Mode  :character   Median :  21.00  
                                       Mean   :  87.18  
                                       3rd Qu.:  65.00  
                                       Max.   :8207.00  

The dataset of railroad across the United States includes three columns–state, county and total employees in each county – and 2930 counties.Each county at least maintains one employees while the maximum number is 8207. The average of total employees each county hires is 81.18. However, the median of it is 27.00, which reflects that most counties maintain a small station. The numbers of 1st Qu. and 3rd Qu. further confirms the finding.

Code
# top-ranking by county
max_railroad = railroad %>%
  arrange(desc(`total_employees`)) %>%
  slice(1:20)
max_railroad
# A tibble: 20 × 3
   state county           total_employees
   <chr> <chr>                      <dbl>
 1 IL    COOK                        8207
 2 TX    TARRANT                     4235
 3 NE    DOUGLAS                     3797
 4 NY    SUFFOLK                     3685
 5 VA    INDEPENDENT CITY            3249
 6 FL    DUVAL                       3073
 7 CA    SAN BERNARDINO              2888
 8 CA    LOS ANGELES                 2545
 9 TX    HARRIS                      2535
10 NE    LINCOLN                     2289
11 NY    NASSAU                      2076
12 MO    JACKSON                     2055
13 IN    LAKE                        1999
14 IL    WILL                        1784
15 PA    PHILADELPHIA                1649
16 NE    LANCASTER                   1619
17 CA    RIVERSIDE                   1567
18 CT    NEW HAVEN                   1561
19 NY    QUEENS                      1470
20 KS    JOHNSON                     1286

Cook county, IL, hires the largest size of employees and reaches to 8207 while Tarrant, TX, is second to it with 4235 employees. The city of Chicago, the third largest city of the United State in terms of population, is located in Cook county, which interprets why Cook hires a large group of employees to manage the operation of railroad. I also propose that residents in IL may be more willing to take train for commuting. Tarrant county hires a large number of employees because it serves the Dallas-Fort Worth-Arlington metropolitan area. Interestingly, New York City’s railroad is not in the top ten. I don’t think it means the city relies less on railroad. In fact, New York City is heavy reliance on railroad for public transportation. But several different authorities run railroads connecting between the city and the suburban area such as New Jersey and Connecticut. I have noted that New Haven, CT hires 1561 employees while Suffolk in Long Island, NY hires 3685. New Haven was the terminal of the Metro-North Railroad connecting between New York City and CT. Long Island Rail Road connects with New York City. (source: Wikipedia)

Code
# each state's percentage in the US
railroad %>% group_by(state) %>% summarise(total=sum(total_employees)) %>% arrange(desc(total)) %>% mutate(total_national=sum(total) , ratio=total/total_national)  %>% slice(1:10)
# A tibble: 10 × 4
   state total total_national  ratio
   <chr> <dbl>          <dbl>  <dbl>
 1 TX    19839         255432 0.0777
 2 IL    19131         255432 0.0749
 3 NY    17050         255432 0.0667
 4 NE    13176         255432 0.0516
 5 CA    13137         255432 0.0514
 6 PA    12769         255432 0.0500
 7 OH     9056         255432 0.0355
 8 GA     8605         255432 0.0337
 9 IN     8537         255432 0.0334
10 MO     8419         255432 0.0330

Although Cook county hires the largest size of employees, Illinois is not the NO.1 in terms of total employees. Texas outnumbers it by several hundreds employees.However, no state reaches 10 % of national total employees. Both Texas and Illinois are over 7% of national total employees. New York is even less 7% of them.

Code
# mean, median, and states of states
railroad %>% group_by(state) %>% summarise(total=sum(total_employees)) %>%  summarise(temployee_mean=mean(total),
            templyee_sd=sd(total, na.rm=TRUE),
            tempyee_median=median(total, na.rm=TRUE))
# A tibble: 1 × 3
  temployee_mean templyee_sd tempyee_median
           <dbl>       <dbl>          <dbl>
1          4819.       4782.           3379

The mean, median and standard deviation of states reflect that total_employees of some big states are overwhelming and some small states only maintain a tiny number of total employees.

Code
# state difference
railroad %>% group_by(state) %>%
  summarise(temployee_mean=mean(total_employees),
            templyee_sd=sd(total_employees, na.rm=TRUE),
            tempyee_median=median(total_employees, na.rm=TRUE))
# A tibble: 53 × 4
   state temployee_mean templyee_sd tempyee_median
   <chr>          <dbl>       <dbl>          <dbl>
 1 AE               2          NA              2  
 2 AK              17.2        34.8            2.5
 3 AL              63.5       130.            26  
 4 AP               1          NA              1  
 5 AR              53.8       131.            16.5
 6 AZ             210.        228.            94  
 7 CA             239.        549.            61  
 8 CO              64.0       128.            10  
 9 CT             324         520.           125  
10 DC             279          NA            279  
# … with 43 more rows

The big differences between mean, median and sd of each state reflect population distribution in he United State.

Code
#TX as a case study
railroad %>% filter(state=='TX') %>% arrange(desc(total_employees)) %>% mutate(total_TX=sum(total_employees), ratio_TX=total_employees/total_TX) %>% filter(total_employees >100)
# A tibble: 34 × 5
   state county     total_employees total_TX ratio_TX
   <chr> <chr>                <dbl>    <dbl>    <dbl>
 1 TX    TARRANT               4235    19839   0.213 
 2 TX    HARRIS                2535    19839   0.128 
 3 TX    BEXAR                  950    19839   0.0479
 4 TX    POTTER                 883    19839   0.0445
 5 TX    EL PASO                863    19839   0.0435
 6 TX    MONTGOMERY             474    19839   0.0239
 7 TX    JOHNSON                429    19839   0.0216
 8 TX    BELL                   413    19839   0.0208
 9 TX    DALLAS                 406    19839   0.0205
10 TX    DENTON                 394    19839   0.0199
# … with 24 more rows
Code
# TX as a case study
railroad %>% filter(state=='TX') %>% summarise(total_mean=mean(total_employees), total_madian=median(total_employees))
# A tibble: 1 × 2
  total_mean total_madian
       <dbl>        <dbl>
1       89.8           17

Cast study 1 TX Taking TX as case study, I try to looking the distribution in a state. The combination of TARRANT and HARRIS reaches 34 % of total employees in TX, which reflects high population density in these two areas.

Code
# case study IL
railroad %>% filter(state=='IL') %>% arrange(desc(total_employees)) %>% mutate(total_IL=sum(total_employees), ratio_IL=total_employees/total_IL) %>% filter(total_employees >100)
# A tibble: 25 × 5
   state county   total_employees total_IL ratio_IL
   <chr> <chr>              <dbl>    <dbl>    <dbl>
 1 IL    COOK                8207    19131   0.429 
 2 IL    WILL                1784    19131   0.0933
 3 IL    KNOX                 885    19131   0.0463
 4 IL    DUPAGE               837    19131   0.0438
 5 IL    KANE                 577    19131   0.0302
 6 IL    ST CLAIR             495    19131   0.0259
 7 IL    MADISON              427    19131   0.0223
 8 IL    MACON                425    19131   0.0222
 9 IL    LAKE                 340    19131   0.0178
10 IL    KANKAKEE             289    19131   0.0151
# … with 15 more rows

Case study of IL Cook county hire 42% employees of railroad in IL demonstrates the overwhelming population density in the city of Chicago as well as reflects the population distribution in IL.

Source Code
---
title: "Challenge 2 Guanua Tan"
author: "Guanhua Tan"
desription: "Data wrangling: using group() and summarise()"
date: "09/19/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_2
  - railroads
  - faostat
  - hotel_bookings
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

## Challenge Overview

Today's challenge is to

1)  read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
2)  provide summary statistics for different interesting groups within the data, and interpret those statistics

## Read in the Data

Read in one (or more) of the following data sets, available in the `posts/_data` folder, using the correct R package and command.

-   railroad\*.csv or StateCounty2012.xls ⭐
-   FAOstat\*.csv or birds.csv ⭐⭐⭐
-   hotel_bookings.csv ⭐⭐⭐⭐

Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.

## Describe the data

Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).

## Provide Grouped Summary Statistics

Conduct some exploratory data analysis, using dplyr commands such as `group_by()`, `select()`, `filter()`, and `summarise()`. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.

### Explain and Interpret

Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.


```{r}
library(dplyr)
railroad<-read_csv("_data/railroad_2012_clean_county.csv")
railroad
```



```{r}
#| label: summary
railroad %>%
  summary(railroad)
```

The dataset of railroad across the United States includes three columns--state, county and total employees in each county -- and 2930 counties.Each county at least maintains one employees while the maximum number is 8207. The average of total employees each county hires is 81.18. However, the median of it is 27.00, which reflects that most counties maintain a small station. The numbers of 1st Qu. and 3rd Qu. further confirms the finding.




```{r}
# top-ranking by county
max_railroad = railroad %>%
  arrange(desc(`total_employees`)) %>%
  slice(1:20)
max_railroad
```
Cook county, IL, hires the largest size of employees and reaches to 8207 while Tarrant, TX, is second to it with 4235 employees. The city of Chicago, the third largest city of the United State in terms of population, is located in Cook county, which interprets why Cook hires a large group of employees to manage the operation of railroad. I also propose that residents in IL may be more willing to take train for commuting. Tarrant county hires a large number of employees because it serves the Dallas-Fort Worth-Arlington metropolitan area. Interestingly, New York City's railroad is not in the top ten. I don't think it means the city relies less on railroad. In fact, New York City is heavy reliance on railroad for public transportation. But several different authorities run railroads connecting between the city and the suburban area such as New Jersey and Connecticut. I have noted that New Haven, CT hires 1561 employees while Suffolk in Long Island, NY hires 3685. New Haven was the terminal of the Metro-North Railroad connecting between New York City and CT. Long Island Rail Road connects with New York City. (source: Wikipedia)


```{r}
# each state's percentage in the US
railroad %>% group_by(state) %>% summarise(total=sum(total_employees)) %>% arrange(desc(total)) %>% mutate(total_national=sum(total) , ratio=total/total_national)  %>% slice(1:10)
```

Although Cook county hires the largest size of employees, Illinois is not the NO.1 in terms of total employees. Texas outnumbers it by several hundreds employees.However, no state reaches 10 % of national total employees. Both Texas and Illinois are over 7% of national total employees. New York is even less 7% of them.

```{r}
# mean, median, and states of states
railroad %>% group_by(state) %>% summarise(total=sum(total_employees)) %>%  summarise(temployee_mean=mean(total),
            templyee_sd=sd(total, na.rm=TRUE),
            tempyee_median=median(total, na.rm=TRUE))
```
The mean, median and standard deviation of states reflect that total_employees of some big states are overwhelming and some small states only maintain a tiny number of total employees.

```{r}
# state difference
railroad %>% group_by(state) %>%
  summarise(temployee_mean=mean(total_employees),
            templyee_sd=sd(total_employees, na.rm=TRUE),
            tempyee_median=median(total_employees, na.rm=TRUE))
```
  The big differences between mean, median and sd of each state reflect population distribution in he United State.


```{r}
#TX as a case study
railroad %>% filter(state=='TX') %>% arrange(desc(total_employees)) %>% mutate(total_TX=sum(total_employees), ratio_TX=total_employees/total_TX) %>% filter(total_employees >100)
```

```{r}
# TX as a case study
railroad %>% filter(state=='TX') %>% summarise(total_mean=mean(total_employees), total_madian=median(total_employees))
```
  Cast study 1 TX
  Taking TX as case study, I try to looking the distribution in a state. The combination of TARRANT and HARRIS reaches 34 % of total employees in TX, which reflects high population density in these two areas.
  
```{r}
# case study IL
railroad %>% filter(state=='IL') %>% arrange(desc(total_employees)) %>% mutate(total_IL=sum(total_employees), ratio_IL=total_employees/total_IL) %>% filter(total_employees >100)
```
  Case study of IL
  Cook county hire 42% employees of railroad in IL demonstrates the overwhelming population density in the city of Chicago as well as reflects the population distribution in IL.