Challenge 2 Submission

challenge_2

railroads

faostat

hotel_bookings

Data wrangling: using group() and summarise()

Author

Xinpeng Liu

Published

May 30, 2023

Code

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to

read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
provide summary statistics for different interesting groups within the data, and interpret those statistics

Read in the Data

Read in one (or more) of the following data sets, available in the posts/_data folder, using the correct R package and command.

railroad*.csv or StateCounty2012.xls ⭐
FAOstat*.csv or birds.csv ⭐⭐⭐
hotel_bookings.csv ⭐⭐⭐⭐

Code

data_railroad <- read_csv("_data/railroad_2012_clean_county.csv")

Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.

Describe the data

Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).

we choose - railroad_2012_clean_county.csv ⭐

Code

library(dplyr)
str(data_railroad)

spc_tbl_ [2,930 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ state          : chr [1:2930] "AE" "AK" "AK" "AK" ...
 $ county         : chr [1:2930] "APO" "ANCHORAGE" "FAIRBANKS NORTH STAR" "JUNEAU" ...
 $ total_employees: num [1:2930] 2 7 2 3 2 1 88 102 143 1 ...
 - attr(*, "spec")=
  .. cols(
  ..   state = col_character(),
  ..   county = col_character(),
  ..   total_employees = col_double()
  .. )
 - attr(*, "problems")=<externalptr>

Code

summary(data_railroad$total_employees)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    7.00   21.00   87.18   65.00 8207.00

This data set contains information about the total number of employees in different counties across states. The dataset consists of three variables: ‘state’, ‘county’, and ‘total_employees’.

-‘state’ (chr): This variable is a character type that represents the abbreviation of each state in the United States. This variable does not have missing values and covers all 50 states plus other regions.

-‘county’ (chr): This variable is a character type that represents the names of the counties in each state. This variable does not have missing values, and there are multiple counties within each state.

-‘total_employees’ (int): This variable is an integer type that represents the total number of employees in each county. The values range from a minimum of 1 to a maximum of 8207, with a median of 21 and a mean of 87.18. The standard deviation is high, suggesting a wide spread in the number of employees across counties.

The data was likely gathered through employment records or surveys in each county and compiled for the year of interest. Each row represents an observation for a specific county in a specific state.

Provide Grouped Summary Statistics

Conduct some exploratory data analysis, using dplyr commands such as group_by(), select(), filter(), and summarise(). Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.

Code

data_grouped_by_state <- data_railroad %>%
  select(state,total_employees) %>%
  group_by(state) %>%
  summarize(
    total_employees_by_state = sum(total_employees, na.rm = TRUE),
    mean_number_of_employee_per_county = round(mean(total_employees, na.rm = TRUE), 2),
    median_number_of_employee_per_county = round(median(total_employees, na.rm = TRUE), 2),
    sd_employees_per_county = round(sd(total_employees, na.rm = TRUE), 2),
    IQR_employees_per_county = round(IQR(total_employees, na.rm = TRUE),2),
    min = min(total_employees, na.rm = TRUE),
    max = max(total_employees, na.rm = TRUE)
  )

# convert to standard data frame
data_grouped_by_state_df <- as.data.frame(data_grouped_by_state)

# print full data frame
print(data_grouped_by_state_df)

   state total_employees_by_state mean_number_of_employee_per_county
1     AE                        2                               2.00
2     AK                      103                              17.17
3     AL                     4257                              63.54
4     AP                        1                               1.00
5     AR                     3871                              53.76
6     AZ                     3153                             210.20
7     CA                    13137                             238.85
8     CO                     3650                              64.04
9     CT                     2592                             324.00
10    DC                      279                             279.00
11    DE                     1495                             498.33
12    FL                     7419                             110.73
13    GA                     8605                              56.61
14    HI                        4                               1.33
15    IA                     4019                              40.60
16    ID                     1563                              43.42
17    IL                    19131                             185.74
18    IN                     8537                              92.79
19    KS                     6092                              64.13
20    KY                     4811                              40.43
21    LA                     3915                              62.14
22    MA                     3379                             281.58
23    MD                     4709                             196.21
24    ME                      654                              40.88
25    MI                     3932                              50.41
26    MN                     5467                              63.57
27    MO                     8419                              73.21
28    MS                     2111                              27.06
29    MT                     3327                              62.77
30    NC                     3143                              33.44
31    ND                     2204                              44.98
32    NE                    13176                             148.04
33    NH                      393                              39.30
34    NJ                     8329                             396.62
35    NM                     1958                              67.52
36    NV                      746                              62.17
37    NY                    17050                             279.51
38    OH                     9056                             102.91
39    OK                     2318                              31.75
40    OR                     2322                              70.36
41    PA                    12769                             196.45
42    RI                      487                              97.40
43    SC                     2296                              49.91
44    SD                      949                              18.25
45    TN                     4952                              54.42
46    TX                    19839                              89.77
47    UT                     1917                              76.68
48    VA                     7551                              82.08
49    VT                      259                              18.50
50    WA                     5222                             133.90
51    WI                     3773                              54.68
52    WV                     3213                              60.62
53    WY                     2876                             130.73
   median_number_of_employee_per_county sd_employees_per_county
1                                   2.0                      NA
2                                   2.5                   34.76
3                                  26.0                  130.17
4                                   1.0                      NA
5                                  16.5                  131.14
6                                  94.0                  227.78
7                                  61.0                  549.47
8                                  10.0                  127.75
9                                 125.0                  520.20
10                                279.0                      NA
11                                158.0                  674.32
12                                 20.0                  386.01
13                                 15.0                  113.13
14                                  1.0                    0.58
15                                 14.0                   76.80
16                                 12.0                   95.55
17                                 42.0                  829.15
18                                 30.0                  233.06
19                                 12.0                  167.36
20                                 11.0                   76.91
21                                 20.0                  101.48
22                                271.0                  203.83
23                                107.5                  233.28
24                                 29.0                   38.12
25                                 13.0                  109.76
26                                 22.0                  122.39
27                                 24.0                  208.12
28                                 11.5                   46.69
29                                 11.0                  122.95
30                                 14.0                   58.59
31                                  8.0                   92.47
32                                 15.0                  511.58
33                                 15.5                   54.33
34                                296.0                  338.22
35                                 26.0                  112.72
36                                 19.0                   94.80
37                                 71.0                  590.78
38                                 41.0                  147.91
39                                 14.0                   55.86
40                                 30.0                  108.45
41                                 85.0                  293.07
42                                 48.0                  129.02
43                                 25.0                   53.91
44                                  5.0                   34.60
45                                 26.0                   94.82
46                                 17.0                  350.12
47                                 16.0                  142.57
48                                 25.5                  340.74
49                                  8.5                   24.54
50                                 29.0                  255.71
51                                 23.0                   82.17
52                                 33.0                   85.75
53                                 60.5                  168.98
   IQR_employees_per_county min  max
1                      0.00   2    2
2                      4.00   1   88
3                     47.00   1  990
4                      0.00   1    1
5                     33.75   1  972
6                    296.00   3  749
7                    188.00   1 2888
8                     39.00   1  553
9                    167.25  26 1561
10                     0.00 279  279
11                   606.50  62 1275
12                    57.50   1 3073
13                    42.50   1  878
14                     0.50   1    2
15                    28.50   1  609
16                    35.50   1  538
17                    84.00   1 8207
18                    55.00   3 1999
19                    47.50   1 1286
20                    33.50   1  483
21                    64.00   1  546
22                   295.00  44  673
23                   313.75   1  809
24                    60.25   2  117
25                    41.00   1  849
26                    43.75   1  651
27                    41.00   1 2055
28                    26.25   1  341
29                    41.00   1  525
30                    25.00   1  322
31                    32.00   1  407
32                    66.00   1 3797
33                    20.25   2  146
34                   474.00  19 1097
35                    49.00   2  431
36                    49.50   1  269
37                   169.00   5 3685
38                    97.25   3  842
39                    25.00   1  377
40                    77.00   2  467
41                   171.00   3 1649
42                    91.00   8  318
43                    56.25   1  220
44                     9.00   1  167
45                    49.00   1  621
46                    50.00   1 4235
47                    44.00   1  580
48                    33.25   1 3249
49                     8.00   3   83
50                    65.50   1 1039
51                    45.00   2  465
52                    68.00   1  406
53                   159.00   3  737

Code

data <- read.csv("_data/railroad_2012_clean_county.csv")
library(dplyr)
library(ggplot2)
# Calculate the mean
mean_total_employees <- mean(data$total_employees, na.rm = TRUE)
print(paste("The mean of total employees is", mean_total_employees))

[1] "The mean of total employees is 87.178156996587"

Code

# Calculate the median
median_total_employees <- median(data$total_employees, na.rm = TRUE)
print(paste("The median of total employees is", median_total_employees))

[1] "The median of total employees is 21"

Code

# Define a function to calculate the mode
getmode <- function(v) {
  uniqv <- unique(v)
  uniqv[which.max(tabulate(match(v, uniqv)))]
}

# Calculate the mode
mode_total_employees <- getmode(data$total_employees)
print(paste("The mode of total employees is", mode_total_employees))

[1] "The mode of total employees is 1"

Code

library(dplyr)

# Calculate the standard deviation
sd_total_employees <- sd(data$total_employees, na.rm = TRUE)
print(paste("The standard deviation of total employees is", sd_total_employees))

[1] "The standard deviation of total employees is 283.635890179709"

Code

# Calculate the minimum
min_total_employees <- min(data$total_employees, na.rm = TRUE)
print(paste("The minimum of total employees is", min_total_employees))

[1] "The minimum of total employees is 1"

Code

# Calculate the maximum
max_total_employees <- max(data$total_employees, na.rm = TRUE)
print(paste("The maximum of total employees is", max_total_employees))

[1] "The maximum of total employees is 8207"

Code

# Calculate the 1st quartile
first_quartile <- quantile(data$total_employees, 0.25, na.rm = TRUE)
print(paste("The first quartile of total employees is", first_quartile))

[1] "The first quartile of total employees is 7"

Code

# Calculate the 3rd quartile
third_quartile <- quantile(data$total_employees, 0.75, na.rm = TRUE)
print(paste("The third quartile of total employees is", third_quartile))

[1] "The third quartile of total employees is 65"

Explain and Interpret

Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.

Code

# Filter for Alabama (AL) and California (CA)
alabama_data <- data_railroad[data_railroad$state == "AL",]
california_data <- data_railroad[data_railroad$state == "CA",]
# Display basic statistics for Alabama
print(paste("The summary of alabama"))

[1] "The summary of alabama"

Code

summary(alabama_data)

    state              county          total_employees 
 Length:67          Length:67          Min.   :  1.00  
 Class :character   Class :character   1st Qu.: 10.50  
 Mode  :character   Mode  :character   Median : 26.00  
                                       Mean   : 63.54  
                                       3rd Qu.: 57.50  
                                       Max.   :990.00

Code

# Display basic statistics for California
print(paste("The summary of California"))

[1] "The summary of California"

Code

summary(california_data)

    state              county          total_employees 
 Length:55          Length:55          Min.   :   1.0  
 Class :character   Class :character   1st Qu.:  12.5  
 Mode  :character   Mode  :character   Median :  61.0  
                                       Mean   : 238.9  
                                       3rd Qu.: 200.5  
                                       Max.   :2888.0

Analyzing data from different states - Alabama and California - reveals insightful patterns about the distribution of railroad employees across counties. I chose these states to compare given their distinct geographical sizes, population densities, and overall distinct regional characteristics.

Alabama, known as “The Heart of Dixie,” has 4257 railroad employees. On average, there are about 63.54 employees in each county. This average might seem small at first, but considering the smaller size and lower population density of Alabama’s counties, it makes sense. Interestingly, the median value is 26, suggesting that the number of employees is less evenly distributed, and there are outliers with significantly more employees. The broad standard deviation of 130.17 supports this idea, revealing a wide spread of values around the mean.

On the other side of the country, California, the “Golden State,” has a total of 13137 railroad employees, which is significantly higher than Alabama’s. It’s no surprise as California is the most populous U.S. state with larger and more urbanized counties. The mean number of employees per county is approximately 238.85, showing a higher overall density of railroad employment. The median is 61, higher than in Alabama, demonstrating that even the counties with fewer employees in California have a larger workforce than most in Alabama. The considerable standard deviation of 549.47 and the wide IQR of 188 tell us about the great variability in California’s counties’ employment numbers.

The comparison between these two states brings to light how geographical and demographic factors may impact the distribution and concentration of employment in the railroad industry. Understanding these patterns could be a stepping stone for further research into factors influencing the job market and economic conditions in these states.