Challenge 1 - Reading and understanding bird data

challenge_1

birds

wildbirds

csv

excel

Reading in csv and excel data

Author

Audrey Bertin

Published

May 30, 2023

Code

library(tidyverse)
knitr::opts_chunk$set(echo = TRUE)

For this challenge, I’ll be reading in the following datasets:

birds.csv ⭐⭐
wild_bird_data.xlsx ⭐⭐⭐

`birds.csv`

First, we start with birds.csv:

Code

birds <- readr::read_csv("_data/birds.csv")

A sample of this data can be seen below:

Code

head(birds)

# A tibble: 6 × 14
  `Domain Code` Domain      `Area Code` Area  `Element Code` Element `Item Code`
  <chr>         <chr>             <dbl> <chr>          <dbl> <chr>         <dbl>
1 QA            Live Anima…           2 Afgh…           5112 Stocks         1057
2 QA            Live Anima…           2 Afgh…           5112 Stocks         1057
3 QA            Live Anima…           2 Afgh…           5112 Stocks         1057
4 QA            Live Anima…           2 Afgh…           5112 Stocks         1057
5 QA            Live Anima…           2 Afgh…           5112 Stocks         1057
6 QA            Live Anima…           2 Afgh…           5112 Stocks         1057
# ℹ 7 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
#   Value <dbl>, Flag <chr>, `Flag Description` <chr>

Code

str(birds)

spc_tbl_ [30,977 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Domain Code     : chr [1:30977] "QA" "QA" "QA" "QA" ...
 $ Domain          : chr [1:30977] "Live Animals" "Live Animals" "Live Animals" "Live Animals" ...
 $ Area Code       : num [1:30977] 2 2 2 2 2 2 2 2 2 2 ...
 $ Area            : chr [1:30977] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
 $ Element Code    : num [1:30977] 5112 5112 5112 5112 5112 ...
 $ Element         : chr [1:30977] "Stocks" "Stocks" "Stocks" "Stocks" ...
 $ Item Code       : num [1:30977] 1057 1057 1057 1057 1057 ...
 $ Item            : chr [1:30977] "Chickens" "Chickens" "Chickens" "Chickens" ...
 $ Year Code       : num [1:30977] 1961 1962 1963 1964 1965 ...
 $ Year            : num [1:30977] 1961 1962 1963 1964 1965 ...
 $ Unit            : chr [1:30977] "1000 Head" "1000 Head" "1000 Head" "1000 Head" ...
 $ Value           : num [1:30977] 4700 4900 5000 5300 5500 5800 6600 6290 6300 6000 ...
 $ Flag            : chr [1:30977] "F" "F" "F" "F" ...
 $ Flag Description: chr [1:30977] "FAO estimate" "FAO estimate" "FAO estimate" "FAO estimate" ...
 - attr(*, "spec")=
  .. cols(
  ..   `Domain Code` = col_character(),
  ..   Domain = col_character(),
  ..   `Area Code` = col_double(),
  ..   Area = col_character(),
  ..   `Element Code` = col_double(),
  ..   Element = col_character(),
  ..   `Item Code` = col_double(),
  ..   Item = col_character(),
  ..   `Year Code` = col_double(),
  ..   Year = col_double(),
  ..   Unit = col_character(),
  ..   Value = col_double(),
  ..   Flag = col_character(),
  ..   `Flag Description` = col_character()
  .. )
 - attr(*, "problems")=<externalptr>

As we can see, we appear to have information on the number of different types of birds in different areas around the world, based on different data sources and estimates. The data appears to be measuring birds in captivity, rather than wild birds.

We have 8 categorical variables and 6 numeric. The numeric variables appear to be discrete in that they can only have whole integer values.

The different types of birds we are tracking are as follows. We have the most data on chickens and the least on pigeons and other birds.

Code

table(birds$Item)


              Chickens                  Ducks Geese and guinea fowls 
                 13074                   6909                   4136 
  Pigeons, other birds                Turkeys 
                  1165                   5693

The different type of measurements we have are:

Code

table(birds$`Flag Description`)


Aggregate, may include official, semi-official, estimated or calculated data 
                                                                        6488 
                                                          Data not available 
                                                                        1002 
                                    FAO data based on imputation methodology 
                                                                        1213 
                                                                FAO estimate 
                                                                       10007 
                                                               Official data 
                                                                       10773 
                                                           Unofficial figure 
                                                                        1494

There appears to be a strange duplicate variable. Both Year and Year Code have identical values.

The birds data has 30,977 rows and 14 columns:

Code

dim(birds)

[1] 30977    14

Looking at missing data, most of our variables have records for every row. There are only two that have missing values, Value and Flag. About 1/30 of the Values are missing and about 1/3 of the flags.

Code

colSums(is.na(birds))

     Domain Code           Domain        Area Code             Area 
               0                0                0                0 
    Element Code          Element        Item Code             Item 
               0                0                0                0 
       Year Code             Year             Unit            Value 
               0                0                0             1036 
            Flag Flag Description 
           10773                0

`wild_bird_data.xlsx`

We read this in using read_excel:

Code

# We need to add skip=1 to skip the first row and get rid of duplicate header labels
wild_birds <- readxl::read_excel("_data/wild_bird_data.xlsx", skip = 1)

A sample of this data can be seen below:

Code

head(wild_birds)

# A tibble: 6 × 2
  `Wet body weight [g]` `Population size`
                  <dbl>             <dbl>
1                  5.46           532194.
2                  7.76          3165107.
3                  8.64          2592997.
4                 10.7           3524193.
5                  7.42           389806.
6                  9.12           604766.

Code

str(wild_birds)

tibble [146 × 2] (S3: tbl_df/tbl/data.frame)
 $ Wet body weight [g]: num [1:146] 5.46 7.76 8.64 10.69 7.42 ...
 $ Population size    : num [1:146] 532194 3165107 2592997 3524193 389806 ...

As we can see, we just have two variables here, the wet body weight (in grams) and the size of the population.

We can calculate some summary statistics about each of these variables:

Code

summary(wild_birds)

 Wet body weight [g] Population size  
 Min.   :   5.459    Min.   :      5  
 1st Qu.:  18.620    1st Qu.:   1821  
 Median :  69.232    Median :  24353  
 Mean   : 363.694    Mean   : 382874  
 3rd Qu.: 309.826    3rd Qu.: 198515  
 Max.   :9639.845    Max.   :5093378

Body weight ranges from ~5.5 grams to ~9640 grams. Population size ranges from 5 to 5,093,378, and we can see the means/medians/quartiles above as well.

There are 146 rows and none of the data appears to be missing.

Code

dim(wild_birds)

[1] 146   2

Code

colSums(is.na(wild_birds))

Wet body weight [g]     Population size 
                  0                   0

It is difficult just from the dataset itself to understand the context of this data. Is the “Wet body weight [g]” the combined wet weight of the whole population? The average weight of birds in that population? It is unclear how the two columns are related.

Also, what type of birds are these? What does a row represent? We would need more information from the source to be able to understand this data further.