Code
library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Poobigan Murugesan
May 9, 2023
Today’s challenge is to
read in a dataset, and
describe the dataset using both words and any supporting information (e.g., tables, etc)
Read in one (or more) of the following data sets, using the correct R package and command.
Find the _data
folder, located inside the posts
folder. Then you can read in the data, using either one of the readr
standard tidy read commands, or a specialized package such as readxl
.
Loading the data and printing the top few rows
Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).
Description of the railroad dataset
spc_tbl_ [2,930 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ state : chr [1:2930] "AE" "AK" "AK" "AK" ...
$ county : chr [1:2930] "APO" "ANCHORAGE" "FAIRBANKS NORTH STAR" "JUNEAU" ...
$ total_employees: num [1:2930] 2 7 2 3 2 1 88 102 143 1 ...
- attr(*, "spec")=
.. cols(
.. state = col_character(),
.. county = col_character(),
.. total_employees = col_double()
.. )
- attr(*, "problems")=<externalptr>
Dimensions of the dataset (rows, columns) and names of columns
Based on the above lines of code, it can be observed that the “railroad_2012_clean_county” dataset provides data on the number of workers employed by the railroad in each state’s various counties in the year 2012. The dataset comprises 2930 records in total, each representing a county and providing information on the number of employees. The dataset consists of three columns: state, county, and total_employees.
Summarizing the data
state county total_employees
Length:2930 Length:2930 Min. : 1.00
Class :character Class :character 1st Qu.: 7.00
Mode :character Mode :character Median : 21.00
Mean : 87.18
3rd Qu.: 65.00
Max. :8207.00
Sorting data based on the total number of employees in increasing order
# A tibble: 2,930 × 3
state county total_employees
<chr> <chr> <dbl>
1 AK SITKA 1
2 AL BARBOUR 1
3 AL HENRY 1
4 AP APO 1
5 AR NEWTON 1
6 CA MONO 1
7 CO BENT 1
8 CO CHEYENNE 1
9 CO COSTILLA 1
10 CO DOLORES 1
# ℹ 2,920 more rows
Sorting data based on the total number of employees in decreasing order
# A tibble: 2,930 × 3
state county total_employees
<chr> <chr> <dbl>
1 IL COOK 8207
2 TX TARRANT 4235
3 NE DOUGLAS 3797
4 NY SUFFOLK 3685
5 VA INDEPENDENT CITY 3249
6 FL DUVAL 3073
7 CA SAN BERNARDINO 2888
8 CA LOS ANGELES 2545
9 TX HARRIS 2535
10 NE LINCOLN 2289
# ℹ 2,920 more rows
Based on these outputs, it is evident that the county named ‘COOK’ in IL has the most employees, which is 8207. Moreover, there exist several counties where only one employee is present, which is the minimum number of employees in any given county.
Grouping railroad employees by state
# A tibble: 53 × 2
state employees
<chr> <dbl>
1 TX 19839
2 IL 19131
3 NY 17050
4 NE 13176
5 CA 13137
6 PA 12769
7 OH 9056
8 GA 8605
9 IN 8537
10 MO 8419
# ℹ 43 more rows
Based on the results shown above, we can see that the state of Texas has the largest number of railroad employees, with a count of 19839 followed by Illinois and New York with 19131 and 17050 employees respectively. Also, from the dimensions of the group dataset we can conclude that there are 53 distinct states where people are employed to work railroads.
---
title: "Challenge 1"
author: "Poobigan Murugesan"
description: "Reading in data and creating a post"
date: "05/09/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_1
- railroads
- poobigan murugesan
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to
1) read in a dataset, and
2) describe the dataset using both words and any supporting information (e.g., tables, etc)
## Read in the Data
Read in one (or more) of the following data sets, using the correct R package and command.
- railroad_2012_clean_county.csv ⭐
- birds.csv ⭐⭐
- FAOstat\*.csv ⭐⭐
- wild_bird_data.xlsx ⭐⭐⭐
- StateCounty2012.xls ⭐⭐⭐⭐
Find the `_data` folder, located inside the `posts` folder. Then you can read in the data, using either one of the `readr` standard tidy read commands, or a specialized package such as `readxl`.
Loading the data and printing the top few rows
```{r}
df <- read_csv("_data/railroad_2012_clean_county.csv")
head(df)
```
## Describe the data
Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).
Description of the railroad dataset
```{r}
str(df)
```
Dimensions of the dataset (rows, columns) and names of columns
```{r}
dim(df)
colnames(df)
```
Based on the above lines of code, it can be observed that the "railroad_2012_clean_county" dataset provides data on the number of workers employed by the railroad in each state's various counties in the year 2012. The dataset comprises 2930 records in total, each representing a county and providing information on the number of employees. The dataset consists of three columns: state, county, and total_employees.
Summarizing the data
```{r}
summary(df)
```
Sorting data based on the total number of employees in increasing order
```{r}
arrange(df,total_employees)
```
Sorting data based on the total number of employees in decreasing order
```{r}
arrange(df, desc(total_employees))
```
Based on these outputs, it is evident that the county named 'COOK' in IL has the most employees, which is 8207. Moreover, there exist several counties where only one employee is present, which is the minimum number of employees in any given county.
Grouping railroad employees by state
```{r}
group <- df %>%
group_by(state) %>%
summarize(employees=sum(total_employees)) %>%
arrange(desc(employees))
group
```
Based on the results shown above, we can see that the state of Texas has the largest number of railroad employees, with a count of 19839 followed by Illinois and New York with 19131 and 17050 employees respectively. Also, from the dimensions of the group dataset we can conclude that there are 53 distinct states where people are employed to work railroads.