Challenge 1

Reading in data and creating a post

Author

Danny Holt

Published

June 1, 2023

`birds.csv`

For this challenge, I will be reading in the dataset birds.csv

Code

birds <- readr::read_csv("_data/birds.csv")

Rows: 30977 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): Domain Code, Domain, Area, Element, Item, Unit, Flag, Flag Description
dbl (6): Area Code, Element Code, Item Code, Year Code, Year, Value

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Let’s look at first several rows of the dataset:

Code

head(birds)

# A tibble: 6 × 14
  `Domain Code` Domain      `Area Code` Area  `Element Code` Element `Item Code`
  <chr>         <chr>             <dbl> <chr>          <dbl> <chr>         <dbl>
1 QA            Live Anima…           2 Afgh…           5112 Stocks         1057
2 QA            Live Anima…           2 Afgh…           5112 Stocks         1057
3 QA            Live Anima…           2 Afgh…           5112 Stocks         1057
4 QA            Live Anima…           2 Afgh…           5112 Stocks         1057
5 QA            Live Anima…           2 Afgh…           5112 Stocks         1057
6 QA            Live Anima…           2 Afgh…           5112 Stocks         1057
# ℹ 7 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
#   Value <dbl>, Flag <chr>, `Flag Description` <chr>

Year and YearCode appear to be duplicate variables.

As shown above, the data has 14 columns and 30977 rows. Let’s look at the column names:

Code

colnames(birds)

 [1] "Domain Code"      "Domain"           "Area Code"        "Area"            
 [5] "Element Code"     "Element"          "Item Code"        "Item"            
 [9] "Year Code"        "Year"             "Unit"             "Value"           
[13] "Flag"             "Flag Description"

Now, I will use spec() to inspect the data types of each of the columns in the dataset. Eight of the variables are categorical and six are numeric.

Code

spec(birds)

cols(
  `Domain Code` = col_character(),
  Domain = col_character(),
  `Area Code` = col_double(),
  Area = col_character(),
  `Element Code` = col_double(),
  Element = col_character(),
  `Item Code` = col_double(),
  Item = col_character(),
  `Year Code` = col_double(),
  Year = col_double(),
  Unit = col_character(),
  Value = col_double(),
  Flag = col_character(),
  `Flag Description` = col_character()
)

Here is a table of all of the types of birds found in the dataset under the column ‘Item’. Chickens appear to be the most common type of bird here.

Code

table(birds$Item)


              Chickens                  Ducks Geese and guinea fowls 
                 13074                   6909                   4136 
  Pigeons, other birds                Turkeys 
                  1165                   5693

Now, we will use colSums(is.na()) to see where data is missing. We see that some data is missing in the ‘Value’ and ‘Flag’ columns.

Code

colSums(is.na(birds))

     Domain Code           Domain        Area Code             Area 
               0                0                0                0 
    Element Code          Element        Item Code             Item 
               0                0                0                0 
       Year Code             Year             Unit            Value 
               0                0                0             1036 
            Flag Flag Description 
           10773                0

The dataset counts different types of live birds (shown in column ‘Item’) in different areas (columns ‘Area’ and ‘Area Code’) and years (‘Year’ and ‘Year Code’). Based on the information in the ‘Flag Description’ column, the data appears to be a mix of collected data and estimates.