challenge_1
Reading in data and creating a post
Author

Siddharth Goel

Published

January 20, 2023

Code
library(tidyverse)
library(readr)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to

  1. read in a dataset, and

  2. describe the dataset using both words and any supporting information (e.g., tables, etc)

Read in the Data

Code
birds_set = read_csv('_data/birds.csv')
spec(birds_set)
cols(
  `Domain Code` = col_character(),
  Domain = col_character(),
  `Area Code` = col_double(),
  Area = col_character(),
  `Element Code` = col_double(),
  Element = col_character(),
  `Item Code` = col_double(),
  Item = col_character(),
  `Year Code` = col_double(),
  Year = col_double(),
  Unit = col_character(),
  Value = col_double(),
  Flag = col_character(),
  `Flag Description` = col_character()
)
Code
head(birds_set)
# A tibble: 6 × 14
  Domai…¹ Domain Area …² Area  Eleme…³ Element Item …⁴ Item  Year …⁵  Year Unit 
  <chr>   <chr>    <dbl> <chr>   <dbl> <chr>     <dbl> <chr>   <dbl> <dbl> <chr>
1 QA      Live …       2 Afgh…    5112 Stocks     1057 Chic…    1961  1961 1000…
2 QA      Live …       2 Afgh…    5112 Stocks     1057 Chic…    1962  1962 1000…
3 QA      Live …       2 Afgh…    5112 Stocks     1057 Chic…    1963  1963 1000…
4 QA      Live …       2 Afgh…    5112 Stocks     1057 Chic…    1964  1964 1000…
5 QA      Live …       2 Afgh…    5112 Stocks     1057 Chic…    1965  1965 1000…
6 QA      Live …       2 Afgh…    5112 Stocks     1057 Chic…    1966  1966 1000…
# … with 3 more variables: Value <dbl>, Flag <chr>, `Flag Description` <chr>,
#   and abbreviated variable names ¹​`Domain Code`, ²​`Area Code`,
#   ³​`Element Code`, ⁴​`Item Code`, ⁵​`Year Code`

This dataset has 14 columns and 30977 data values. All the columns are either of the type col_character or col_double

Describe the data

From the columns and the data, we can see that there are multiple columns that represent the same data in multiple forms. For example, Area and Area code, Domain and Domain Code, Element and Element Code, Item and Item Code, and Year and Year Code. We can de-duplicate these columns and create separate mappings to reduce the size of the data. Also, the columns Domain and Element have a single value which means that these columns can also be eliminated.

Code
unique(birds_set$Domain)
[1] "Live Animals"
Code
length(unique(birds_set$Area))
[1] 248
Code
unique(birds_set$Item)
[1] "Chickens"               "Ducks"                  "Geese and guinea fowls"
[4] "Turkeys"                "Pigeons, other birds"  
Code
unique(birds_set$Element)
[1] "Stocks"
Code
length(unique(birds_set$Year))
[1] 58
Code
unique(birds_set$Flag)
[1] "F"  NA   "Im" "M"  "*"  "A" 

By analyzing the data values and the unique column values, we can assert that the dataset contains the livestock data about five birds over a certain period of time. This dataset is over a single domain, which is Live Animals and mainly contains information about birds and the regions they belong to.