Code
library(tidyverse)
::opts_chunk$set(echo = TRUE) knitr
Audrey Bertin
May 30, 2023
For this challenge, I’ll be reading in the following datasets:
birds.csv
First, we start with birds.csv
:
A sample of this data can be seen below:
# A tibble: 6 × 14
`Domain Code` Domain `Area Code` Area `Element Code` Element `Item Code`
<chr> <chr> <dbl> <chr> <dbl> <chr> <dbl>
1 QA Live Anima… 2 Afgh… 5112 Stocks 1057
2 QA Live Anima… 2 Afgh… 5112 Stocks 1057
3 QA Live Anima… 2 Afgh… 5112 Stocks 1057
4 QA Live Anima… 2 Afgh… 5112 Stocks 1057
5 QA Live Anima… 2 Afgh… 5112 Stocks 1057
6 QA Live Anima… 2 Afgh… 5112 Stocks 1057
# ℹ 7 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
# Value <dbl>, Flag <chr>, `Flag Description` <chr>
spc_tbl_ [30,977 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ Domain Code : chr [1:30977] "QA" "QA" "QA" "QA" ...
$ Domain : chr [1:30977] "Live Animals" "Live Animals" "Live Animals" "Live Animals" ...
$ Area Code : num [1:30977] 2 2 2 2 2 2 2 2 2 2 ...
$ Area : chr [1:30977] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
$ Element Code : num [1:30977] 5112 5112 5112 5112 5112 ...
$ Element : chr [1:30977] "Stocks" "Stocks" "Stocks" "Stocks" ...
$ Item Code : num [1:30977] 1057 1057 1057 1057 1057 ...
$ Item : chr [1:30977] "Chickens" "Chickens" "Chickens" "Chickens" ...
$ Year Code : num [1:30977] 1961 1962 1963 1964 1965 ...
$ Year : num [1:30977] 1961 1962 1963 1964 1965 ...
$ Unit : chr [1:30977] "1000 Head" "1000 Head" "1000 Head" "1000 Head" ...
$ Value : num [1:30977] 4700 4900 5000 5300 5500 5800 6600 6290 6300 6000 ...
$ Flag : chr [1:30977] "F" "F" "F" "F" ...
$ Flag Description: chr [1:30977] "FAO estimate" "FAO estimate" "FAO estimate" "FAO estimate" ...
- attr(*, "spec")=
.. cols(
.. `Domain Code` = col_character(),
.. Domain = col_character(),
.. `Area Code` = col_double(),
.. Area = col_character(),
.. `Element Code` = col_double(),
.. Element = col_character(),
.. `Item Code` = col_double(),
.. Item = col_character(),
.. `Year Code` = col_double(),
.. Year = col_double(),
.. Unit = col_character(),
.. Value = col_double(),
.. Flag = col_character(),
.. `Flag Description` = col_character()
.. )
- attr(*, "problems")=<externalptr>
As we can see, we appear to have information on the number of different types of birds in different areas around the world, based on different data sources and estimates. The data appears to be measuring birds in captivity, rather than wild birds.
We have 8 categorical variables and 6 numeric. The numeric variables appear to be discrete in that they can only have whole integer values.
The different types of birds we are tracking are as follows. We have the most data on chickens and the least on pigeons and other birds.
Chickens Ducks Geese and guinea fowls
13074 6909 4136
Pigeons, other birds Turkeys
1165 5693
The different type of measurements we have are:
Aggregate, may include official, semi-official, estimated or calculated data
6488
Data not available
1002
FAO data based on imputation methodology
1213
FAO estimate
10007
Official data
10773
Unofficial figure
1494
There appears to be a strange duplicate variable. Both Year
and Year Code
have identical values.
The birds data has 30,977 rows and 14 columns:
Looking at missing data, most of our variables have records for every row. There are only two that have missing values, Value
and Flag
. About 1/30 of the Values are missing and about 1/3 of the flags.
wild_bird_data.xlsx
We read this in using read_excel
:
A sample of this data can be seen below:
# A tibble: 6 × 2
`Wet body weight [g]` `Population size`
<dbl> <dbl>
1 5.46 532194.
2 7.76 3165107.
3 8.64 2592997.
4 10.7 3524193.
5 7.42 389806.
6 9.12 604766.
tibble [146 × 2] (S3: tbl_df/tbl/data.frame)
$ Wet body weight [g]: num [1:146] 5.46 7.76 8.64 10.69 7.42 ...
$ Population size : num [1:146] 532194 3165107 2592997 3524193 389806 ...
As we can see, we just have two variables here, the wet body weight (in grams) and the size of the population.
We can calculate some summary statistics about each of these variables:
Wet body weight [g] Population size
Min. : 5.459 Min. : 5
1st Qu.: 18.620 1st Qu.: 1821
Median : 69.232 Median : 24353
Mean : 363.694 Mean : 382874
3rd Qu.: 309.826 3rd Qu.: 198515
Max. :9639.845 Max. :5093378
Body weight ranges from ~5.5 grams to ~9640 grams. Population size ranges from 5 to 5,093,378, and we can see the means/medians/quartiles above as well.
There are 146 rows and none of the data appears to be missing.
It is difficult just from the dataset itself to understand the context of this data. Is the “Wet body weight [g]” the combined wet weight of the whole population? The average weight of birds in that population? It is unclear how the two columns are related.
Also, what type of birds are these? What does a row represent? We would need more information from the source to be able to understand this data further.
---
title: "Challenge 1 - Reading and understanding bird data"
author: "Audrey Bertin"
description: "Reading in csv and excel data"
date: "5/30/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_1
- birds
- wildbirds
- csv
- excel
---
```{r}
#| label: setup
#| warning: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE)
```
For this challenge, I'll be reading in the following datasets:
- birds.csv ⭐⭐
- wild_bird_data.xlsx ⭐⭐⭐
## `birds.csv`
First, we start with `birds.csv`:
```{r, message=FALSE}
birds <- readr::read_csv("_data/birds.csv")
```
A sample of this data can be seen below:
```{r}
head(birds)
```
```{r}
str(birds)
```
As we can see, we appear to have information on the number of different types of birds in different areas around the world, based on different data sources and estimates. The data appears to be measuring birds in captivity, rather than wild birds.
We have 8 categorical variables and 6 numeric. The numeric variables appear to be discrete in that they can only have whole integer values.
The different types of birds we are tracking are as follows. We have the most data on chickens and the least on pigeons and other birds.
```{r}
table(birds$Item)
```
The different type of measurements we have are:
```{r}
table(birds$`Flag Description`)
```
There appears to be a strange duplicate variable. Both `Year` and `Year Code` have identical values.
The birds data has 30,977 rows and 14 columns:
```{r}
dim(birds)
```
Looking at missing data, most of our variables have records for every row. There are only two that have missing values, `Value` and `Flag`. About 1/30 of the Values are missing and about 1/3 of the flags.
```{r}
colSums(is.na(birds))
```
## `wild_bird_data.xlsx`
We read this in using `read_excel`:
```{r message=FALSE}
# We need to add skip=1 to skip the first row and get rid of duplicate header labels
wild_birds <- readxl::read_excel("_data/wild_bird_data.xlsx", skip = 1)
```
A sample of this data can be seen below:
```{r}
head(wild_birds)
```
```{r}
str(wild_birds)
```
As we can see, we just have two variables here, the wet body weight (in grams) and the size of the population.
We can calculate some summary statistics about each of these variables:
```{r}
summary(wild_birds)
```
Body weight ranges from ~5.5 grams to ~9640 grams. Population size ranges from 5 to 5,093,378, and we can see the means/medians/quartiles above as well.
There are 146 rows and none of the data appears to be missing.
```{r}
dim(wild_birds)
```
```{r}
colSums(is.na(wild_birds))
```
It is difficult just from the dataset itself to understand the context of this data. Is the "Wet body weight [g]" the combined wet weight of the whole population? The average weight of birds in that population? It is unclear how the two columns are related.
Also, what type of birds are these? What does a row represent? We would need more information from the source to be able to understand this data further.