Nick Boonstra Challenge 1 Resubmit

challenge_1
boonstra
week_1
birds
Author

Nick Boonstra

Published

August 16, 2022

Introduction

This challenge involves reading in and cleaning up data from the “birds” data set.

Setup chunk

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Reading In and Cleaning Up

Code
birds<-read_csv("_data/birds.csv")
birds
# A tibble: 30,977 × 14
   Domain Cod…¹ Domain Area …² Area  Eleme…³ Element Item …⁴ Item  Year …⁵  Year
   <chr>        <chr>    <dbl> <chr>   <dbl> <chr>     <dbl> <chr>   <dbl> <dbl>
 1 QA           Live …       2 Afgh…    5112 Stocks     1057 Chic…    1961  1961
 2 QA           Live …       2 Afgh…    5112 Stocks     1057 Chic…    1962  1962
 3 QA           Live …       2 Afgh…    5112 Stocks     1057 Chic…    1963  1963
 4 QA           Live …       2 Afgh…    5112 Stocks     1057 Chic…    1964  1964
 5 QA           Live …       2 Afgh…    5112 Stocks     1057 Chic…    1965  1965
 6 QA           Live …       2 Afgh…    5112 Stocks     1057 Chic…    1966  1966
 7 QA           Live …       2 Afgh…    5112 Stocks     1057 Chic…    1967  1967
 8 QA           Live …       2 Afgh…    5112 Stocks     1057 Chic…    1968  1968
 9 QA           Live …       2 Afgh…    5112 Stocks     1057 Chic…    1969  1969
10 QA           Live …       2 Afgh…    5112 Stocks     1057 Chic…    1970  1970
# … with 30,967 more rows, 4 more variables: Unit <chr>, Value <dbl>,
#   Flag <chr>, `Flag Description` <chr>, and abbreviated variable names
#   ¹​`Domain Code`, ²​`Area Code`, ³​`Element Code`, ⁴​`Item Code`, ⁵​`Year Code`
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

Fortunately, all of the data came in tidy, meaning:

  • Each column measured observations of only one variable;
  • Each row provided values for only one observation; and,
  • Each cell contained only one value.

Data Cleanup

However, while every column was nominally tidy-compliant, a number of them were extraneous, in that they provided redundant or useless data. These columns, specifically, were “Domain,” “Domain Code,” “Element,” “Element Code,” “Unit,” and “Year Code.”

“Year Code” was redundant, in that its values were equal to “Year” for every observation. Of course, attempting to verify this by hand would be borderline impossible, but luckily this was easily testable by some quick code:

Code
birds_test <- read_csv("_data/birds.csv")
birds_test <- birds_test %>%
  mutate(year_test = case_when(
    Year == `Year Code` ~ 1,
    TRUE ~ 0
  ))
count(birds_test,year_test)
# A tibble: 1 × 2
  year_test     n
      <dbl> <int>
1         1 30977
Code
rm(birds_test)

Because the value of value of the dummy “year_test” variable is equal to 1 for all observations, we can know that “Year Code” was equal to “Year” for all observations, and thus eliminate “Year Code” without losing any information.

In the case of the other five columns named above, all observations contained the same value, making the columns practically useless. Once again, this assertion was easily testable:

Code
count(birds,Domain)
# A tibble: 1 × 2
  Domain           n
  <chr>        <int>
1 Live Animals 30977
Code
count(birds,`Domain Code`)
# A tibble: 1 × 2
  `Domain Code`     n
  <chr>         <int>
1 QA            30977
Code
count(birds,Element)
# A tibble: 1 × 2
  Element     n
  <chr>   <int>
1 Stocks  30977
Code
count(birds,`Element Code`)
# A tibble: 1 × 2
  `Element Code`     n
           <dbl> <int>
1           5112 30977
Code
count(birds,Unit)
# A tibble: 1 × 2
  Unit          n
  <chr>     <int>
1 1000 Head 30977

Thus, with each column only carrying one value throughout the data set, these columns could be deleted without pratically losing information. In this case, however, the values being removed from the dataframe are not already being kept somewhere else in the dataframe, as opposed to the duplication found in the Year/Year Code case. Thus, recording these values in a separate location may be desirable, depending upon the exact nature of the dataset and the desired analysis.

Beyond removing these extraneous columns, the only other adjustments I found necessary were to rename the remaining columns to abide by “snake_case.” This was done for practicality (some of the column names had spaces), consistency, and personal preference.

All of these changes, then, are seen here:

Code
birds<-birds %>%
  select(-starts_with("Domain")) %>% ## All values identical for all obs
  rename(area_code = `Area Code`) %>%
  rename(area = Area) %>%
  select(-starts_with("Element")) %>% ## All values identical for all obs
  rename(item_code = `Item Code`) %>%
  rename(item = Item) %>%
  select(-`Year Code`) %>% ## Values identical to Year for all obs
  rename(year = Year) %>%
  select(-Unit) %>% ## All values identical for all obs
  rename(value = Value) %>%
  rename(flag = Flag) %>%
  rename(flag_desc = `Flag Description`)

birds
# A tibble: 30,977 × 8
   area_code area        item_code item      year value flag  flag_desc    
       <dbl> <chr>           <dbl> <chr>    <dbl> <dbl> <chr> <chr>        
 1         2 Afghanistan      1057 Chickens  1961  4700 F     FAO estimate 
 2         2 Afghanistan      1057 Chickens  1962  4900 F     FAO estimate 
 3         2 Afghanistan      1057 Chickens  1963  5000 F     FAO estimate 
 4         2 Afghanistan      1057 Chickens  1964  5300 F     FAO estimate 
 5         2 Afghanistan      1057 Chickens  1965  5500 F     FAO estimate 
 6         2 Afghanistan      1057 Chickens  1966  5800 F     FAO estimate 
 7         2 Afghanistan      1057 Chickens  1967  6600 F     FAO estimate 
 8         2 Afghanistan      1057 Chickens  1968  6290 <NA>  Official data
 9         2 Afghanistan      1057 Chickens  1969  6300 F     FAO estimate 
10         2 Afghanistan      1057 Chickens  1970  6000 F     FAO estimate 
# … with 30,967 more rows
# ℹ Use `print(n = ...)` to see more rows

Describing the data

These data appear to be recording the populations of various types of birds across a number of countries and years.

Code
names(birds)
[1] "area_code" "area"      "item_code" "item"      "year"      "value"    
[7] "flag"      "flag_desc"
Code
birds %>%
  group_by(item) %>%
  summarise("Median Values by Type" = median(value,na.rm=T))
# A tibble: 5 × 2
  item                   `Median Values by Type`
  <chr>                                    <dbl>
1 Chickens                                10784.
2 Ducks                                     510 
3 Geese and guinea fowls                    258 
4 Pigeons, other birds                     2800 
5 Turkeys                                   528 
Code
birds %>%
  group_by(year) %>%
  summarise("Median Values by Year" = median(value,na.rm=T))
# A tibble: 58 × 2
    year `Median Values by Year`
   <dbl>                   <dbl>
 1  1961                   1033 
 2  1962                   1014 
 3  1963                   1106 
 4  1964                   1103 
 5  1965                   1104 
 6  1966                   1088.
 7  1967                   1193 
 8  1968                   1252.
 9  1969                   1267 
10  1970                   1259 
# … with 48 more rows
# ℹ Use `print(n = ...)` to see more rows
Code
birds %>%
  group_by(area) %>%
  summarise("Median Values by Area" = median(value,na.rm=T))
# A tibble: 248 × 2
   area                `Median Values by Area`
   <chr>                                 <dbl>
 1 Afghanistan                          6700  
 2 Africa                              12910. 
 3 Albania                              1300  
 4 Algeria                                42.5
 5 American Samoa                         38  
 6 Americas                            66924. 
 7 Angola                               6075  
 8 Antigua and Barbuda                    85  
 9 Argentina                            2355  
10 Armenia                              1528. 
# … with 238 more rows
# ℹ Use `print(n = ...)` to see more rows