Code
library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Jack Sniezek
November 28, 2022
Today’s challenge is to
read in a dataset, and
describe the dataset using both words and any supporting information (e.g., tables, etc)
# A tibble: 30,977 × 14
Domain Cod…¹ Domain Area …² Area Eleme…³ Element Item …⁴ Item Year …⁵ Year
<chr> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl>
1 QA Live … 2 Afgh… 5112 Stocks 1057 Chic… 1961 1961
2 QA Live … 2 Afgh… 5112 Stocks 1057 Chic… 1962 1962
3 QA Live … 2 Afgh… 5112 Stocks 1057 Chic… 1963 1963
4 QA Live … 2 Afgh… 5112 Stocks 1057 Chic… 1964 1964
5 QA Live … 2 Afgh… 5112 Stocks 1057 Chic… 1965 1965
6 QA Live … 2 Afgh… 5112 Stocks 1057 Chic… 1966 1966
7 QA Live … 2 Afgh… 5112 Stocks 1057 Chic… 1967 1967
8 QA Live … 2 Afgh… 5112 Stocks 1057 Chic… 1968 1968
9 QA Live … 2 Afgh… 5112 Stocks 1057 Chic… 1969 1969
10 QA Live … 2 Afgh… 5112 Stocks 1057 Chic… 1970 1970
# … with 30,967 more rows, 4 more variables: Unit <chr>, Value <dbl>,
# Flag <chr>, `Flag Description` <chr>, and abbreviated variable names
# ¹`Domain Code`, ²`Area Code`, ³`Element Code`, ⁴`Item Code`, ⁵`Year Code`
Domain Code Domain Area Code Area
Length:30977 Length:30977 Min. : 1 Length:30977
Class :character Class :character 1st Qu.: 79 Class :character
Mode :character Mode :character Median : 156 Mode :character
Mean :1202
3rd Qu.: 231
Max. :5504
Element Code Element Item Code Item
Min. :5112 Length:30977 Min. :1057 Length:30977
1st Qu.:5112 Class :character 1st Qu.:1057 Class :character
Median :5112 Mode :character Median :1068 Mode :character
Mean :5112 Mean :1066
3rd Qu.:5112 3rd Qu.:1072
Max. :5112 Max. :1083
Year Code Year Unit Value
Min. :1961 Min. :1961 Length:30977 Min. : 0
1st Qu.:1976 1st Qu.:1976 Class :character 1st Qu.: 171
Median :1992 Median :1992 Mode :character Median : 1800
Mean :1991 Mean :1991 Mean : 99411
3rd Qu.:2005 3rd Qu.:2005 3rd Qu.: 15404
Max. :2018 Max. :2018 Max. :23707134
NA's :1036
Flag Flag Description
Length:30977 Length:30977
Class :character Class :character
Mode :character Mode :character
At first glance of the birds dataset I can see that there are 14 variables, 8 being character based and 6 being numeric. I was quick to notice that there was a single Element “Stocks” based on the summary statistics. Checking Domain and Flag for unique elements showed me that there was one Domain “Live Animals” and six unique flag descriptions describing how the data was obtained. Observations were made from 1961-2018.
# A tibble: 1 × 1
Element
<chr>
1 Stocks
# A tibble: 1 × 1
Domain
<chr>
1 Live Animals
# A tibble: 6 × 1
`Flag Description`
<chr>
1 FAO estimate
2 Official data
3 FAO data based on imputation methodology
4 Data not available
5 Unofficial figure
6 Aggregate, may include official, semi-official, estimated or calculated data
Checking for unique Items and Areas showed me that there are five unique Items(Chickens, Ducks, Geese and fowls, Turkeys, and Pigeons/other) and a total of 248 Areas.
# A tibble: 248 × 2
`Area Code` Area
<dbl> <chr>
1 2 Afghanistan
2 3 Albania
3 4 Algeria
4 5 American Samoa
5 7 Angola
6 8 Antigua and Barbuda
7 9 Argentina
8 1 Armenia
9 22 Aruba
10 10 Australia
# … with 238 more rows
# A tibble: 5 × 2
`Item Code` Item
<dbl> <chr>
1 1057 Chickens
2 1068 Ducks
3 1072 Geese and guinea fowls
4 1079 Turkeys
5 1083 Pigeons, other birds
For each observation, one unit is equal to 1000 head.
# A tibble: 1 × 1
Unit
<chr>
1 1000 Head
My initial thoughts are that the relevant data in this dataset seems to be Items, Areas, Area Code, Values, and Years. Another noteworthy observation I made was that the Values seem to jump to the millions between the 3rd Quartile and the Max. This makes me think there could be outliers in play.
# A tibble: 30,977 × 1
Area
<chr>
1 World
2 World
3 World
4 World
5 World
6 World
7 World
8 World
9 World
10 World
# … with 30,967 more rows
By arranging the Areas by descending values I see that there are World values being used, which would severely affect the data, but that is a problem for another day.
---
title: "Challenge 1"
author: "Jack Sniezek"
desription: "Reading in data and creating a post"
date: "11/28/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_1
- railroads
- faostat
- wildbirds
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to
1) read in a dataset, and
2) describe the dataset using both words and any supporting information (e.g., tables, etc)
## Read in the Data
- birds.csv ⭐⭐
```{r}
birds <- read_csv("_data/birds.csv")
birds
summary(birds)
```
## Describe the data
At first glance of the birds dataset I can see that there are 14 variables, 8 being character based and 6 being numeric. I was quick to notice that there was a single Element "Stocks" based on the summary statistics. Checking Domain and Flag for unique elements showed me that there was one Domain "Live Animals" and six unique flag descriptions describing how the data was obtained. Observations were made from 1961-2018.
```{r}
#| label: birds data wrangling
uniq_element <- select(birds,"Element")
num_elements <- unique(uniq_element)
num_elements
uniq_domains <- select(birds,"Domain")
num_domains <- unique(uniq_domains)
num_domains
uniq_flag <- select(birds,"Flag Description")
num_flags <- unique(uniq_flag)
num_flags
```
Checking for unique Items and Areas showed me that there are five unique Items(Chickens, Ducks, Geese and fowls, Turkeys, and Pigeons/other) and a total of 248 Areas.
```{r}
uniq_area <- select(birds,"Area Code", Area)
num_areas <- unique(uniq_area)
num_areas
uniq_item <- select(birds,"Item Code", Item)
num_items <- unique(uniq_item)
num_items
```
For each observation, one unit is equal to 1000 head.
```{r}
uniq_unit <- select(birds,"Unit", Unit)
num_units <- unique(uniq_unit)
num_units
```
My initial thoughts are that the relevant data in this dataset seems to be Items, Areas, Area Code, Values, and Years. Another noteworthy observation I made was that the Values seem to jump to the millions between the 3rd Quartile and the Max. This makes me think there could be outliers in play.
```{r}
Values <- select(birds,"Value")
Areas <- select(birds,"Area")
arrange(Areas,desc(`Values`))
```
By arranging the Areas by descending values I see that there are World values being used, which would severely affect the data, but that is a problem for another day.