Challenge 2 Instructions

challenge_2
railroads
faostat
hotel_bookings
Author

Aditya Salveru

Published

April 22, 2023

Code
# 
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to

  1. read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
  2. provide summary statistics for different interesting groups within the data, and interpret those statistics

Read in the Data

Read in one (or more) of the following data sets, available in the posts/_data folder, using the correct R package and command.

  • railroad*.csv or StateCounty2012.xls ⭐
  • FAOstat*.csv or birds.csv ⭐⭐⭐
  • hotel_bookings.csv ⭐⭐⭐⭐

We are reading the railroad data

Code
railroad <- read.csv("_data/railroad_2012_clean_county.csv")
head(railroad)
  state               county total_employees
1    AE                  APO               2
2    AK            ANCHORAGE               7
3    AK FAIRBANKS NORTH STAR               2
4    AK               JUNEAU               3
5    AK    MATANUSKA-SUSITNA               2
6    AK                SITKA               1

Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.

Describe the data

Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).

Code
dim(railroad)
[1] 2930    3
Code
str(railroad)
'data.frame':   2930 obs. of  3 variables:
 $ state          : chr  "AE" "AK" "AK" "AK" ...
 $ county         : chr  "APO" "ANCHORAGE" "FAIRBANKS NORTH STAR" "JUNEAU" ...
 $ total_employees: int  2 7 2 3 2 1 88 102 143 1 ...

The dataset contains 2930 rows with three columns.

The three columns are ‘state’, ‘county’ and ‘total_employees’

Provide Grouped Summary Statistics

Conduct some exploratory data analysis, using dplyr commands such as group_by(), select(), filter(), and summarise(). Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.

Total employees per state state:

Code
state_wise = select(railroad, state, total_employees)
state_wise %>% 
  group_by(state) %>%
  summarize(totalEmployees=sum(total_employees))
# A tibble: 53 × 2
   state totalEmployees
   <chr>          <int>
 1 AE                 2
 2 AK               103
 3 AL              4257
 4 AP                 1
 5 AR              3871
 6 AZ              3153
 7 CA             13137
 8 CO              3650
 9 CT              2592
10 DC               279
# ℹ 43 more rows

Median number per county in every state:

Code
state_wise = select(railroad, state, total_employees)
state_wise %>% 
  group_by(state) %>%
  summarize(meanEmployees=mean(total_employees),medianEmployees=median(total_employees),standardDeviation = sd(total_employees))
# A tibble: 53 × 4
   state meanEmployees medianEmployees standardDeviation
   <chr>         <dbl>           <dbl>             <dbl>
 1 AE              2               2                NA  
 2 AK             17.2             2.5              34.8
 3 AL             63.5            26               130. 
 4 AP              1               1                NA  
 5 AR             53.8            16.5             131. 
 6 AZ            210.             94               228. 
 7 CA            239.             61               549. 
 8 CO             64.0            10               128. 
 9 CT            324             125               520. 
10 DC            279             279                NA  
# ℹ 43 more rows

State wise employees in the descending order:

Code
state_wise_grouped <- state_wise %>% 
  group_by(state) %>%
  summarize(Sum = sum(total_employees))

sorted <- state_wise_grouped %>%
  arrange(desc(Sum))

sorted
# A tibble: 53 × 2
   state   Sum
   <chr> <int>
 1 TX    19839
 2 IL    19131
 3 NY    17050
 4 NE    13176
 5 CA    13137
 6 PA    12769
 7 OH     9056
 8 GA     8605
 9 IN     8537
10 MO     8419
# ℹ 43 more rows

Explain and Interpret

Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.

We have grouped the total_employees state wise, i.e. we have computed the metrics mean,median and total of all the counties for different states.

We can see that there are few states with only 1 county, therefore the standard deviation is 0(NA). Upon sorting based on the total number of employees, we can see that TX and IL have the highest number of employees whereas AP has only 1 employee.