Code
library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Shreya Varma
June 17, 2023
Today’s challenge is to
read in a dataset, and
describe the dataset using both words and any supporting information (e.g., tables, etc)
Solution: As it is a csv file I am using read_csv() funtion to read the data and viewing the top 10 rows to see if the data was correctly imported along with headers.
# A tibble: 10 × 3
state county total_employees
<chr> <chr> <dbl>
1 AE APO 2
2 AK ANCHORAGE 7
3 AK FAIRBANKS NORTH STAR 2
4 AK JUNEAU 3
5 AK MATANUSKA-SUSITNA 2
6 AK SITKA 1
7 AK SKAGWAY MUNICIPALITY 88
8 AL AUTAUGA 102
9 AL BALDWIN 143
10 AL BARBOUR 1
Solution: This looks like a dataset of number of Railroad employees in the US at a state and county level. The Cook county in Illinois had maximum exployees 8207 and there are multiple counties which had only 1 employee which is the least number of employees. The average number of employees in each county is 87. We see that the county name has duplicates this means that it is not unique and the associated state information is necessary. There are a total of 255,432 employees for 53 states in the US. There are no missing values in any of the columns.
cols(
state = col_character(),
county = col_character(),
total_employees = col_double()
)
state county total_employees
Length:2930 Length:2930 Min. : 1.00
Class :character Class :character 1st Qu.: 7.00
Mode :character Mode :character Median : 21.00
Mean : 87.18
3rd Qu.: 65.00
Max. :8207.00
[1] 53
[1] 255432
state county total_employees
0 0 0
[1] TRUE
# A tibble: 1,221 × 3
state county total_employees
<chr> <chr> <dbl>
1 AP APO 1
2 AR CALHOUN 5
3 AR CLAY 13
4 AR CLEBURNE 8
5 AR DALLAS 12
6 AR FRANKLIN 5
7 AR GREENE 15
8 AR JACKSON 13
9 AR JEFFERSON 361
10 AR LAWRENCE 32
# ℹ 1,211 more rows
# A tibble: 1,221 × 3
state county total_employees
<chr> <chr> <dbl>
1 AP APO 1
2 AR CALHOUN 5
3 AR CLAY 13
4 AR CLEBURNE 8
5 AR DALLAS 12
6 AR FRANKLIN 5
7 AR GREENE 15
8 AR JACKSON 13
9 AR JEFFERSON 361
10 AR LAWRENCE 32
# ℹ 1,211 more rows
# A tibble: 2 × 3
state county total_employees
<chr> <chr> <dbl>
1 AE APO 2
2 AP APO 1
# A tibble: 2,930 × 3
state county total_employees
<chr> <chr> <dbl>
1 IL COOK 8207
2 TX TARRANT 4235
3 NE DOUGLAS 3797
4 NY SUFFOLK 3685
5 VA INDEPENDENT CITY 3249
6 FL DUVAL 3073
7 CA SAN BERNARDINO 2888
8 CA LOS ANGELES 2545
9 TX HARRIS 2535
10 NE LINCOLN 2289
# ℹ 2,920 more rows
---
title: "Challenge 1 Solution"
author: "Shreya Varma"
description: "Reading in data and creating a post"
date: "6/17/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_1
- railroads
- faostat
- wildbirds
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to
1) read in a dataset, and
2) describe the dataset using both words and any supporting information (e.g., tables, etc)
## Read in the Data
Solution: As it is a csv file I am using read_csv() funtion to read the data and viewing the top 10 rows to see if the data was correctly imported along with headers.
```{r}
railroad_data <- read_csv("_data/railroad_2012_clean_county.csv")
head(railroad_data,10)
```
## Describe the data
Solution: This looks like a dataset of number of Railroad employees in the US at a state and county level. The Cook county in Illinois had maximum exployees 8207 and there are multiple counties which had only 1 employee which is the least number of employees. The average number of employees in each county is 87. We see that the county name has duplicates this means that it is not unique and the associated state information is necessary. There are a total of 255,432 employees for 53 states in the US. There are no missing values in any of the columns.
```{r}
#| label: summary
spec(railroad_data)
summary(railroad_data)
n_distinct(railroad_data$state)
sum(railroad_data$total_employees)
colSums(is.na(railroad_data))
any(duplicated(railroad_data$county))
subset(railroad_data,duplicated(county))
railroad_data[duplicated(railroad_data$county), ]
railroad_data[railroad_data$county == 'APO',]
railroad_data %>% arrange(desc(total_employees))
```