DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 1

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Challenge Overview
  • Read in the Data
  • Describe the data

Challenge 1

  • Show All Code
  • Hide All Code

  • View Source
challenge_1
railroads
faostat
wildbirds
Author

Jack Sniezek

Published

November 28, 2022

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to

  1. read in a dataset, and

  2. describe the dataset using both words and any supporting information (e.g., tables, etc)

Read in the Data

  • birds.csv ⭐⭐
Code
birds <- read_csv("_data/birds.csv")
birds
# A tibble: 30,977 × 14
   Domain Cod…¹ Domain Area …² Area  Eleme…³ Element Item …⁴ Item  Year …⁵  Year
   <chr>        <chr>    <dbl> <chr>   <dbl> <chr>     <dbl> <chr>   <dbl> <dbl>
 1 QA           Live …       2 Afgh…    5112 Stocks     1057 Chic…    1961  1961
 2 QA           Live …       2 Afgh…    5112 Stocks     1057 Chic…    1962  1962
 3 QA           Live …       2 Afgh…    5112 Stocks     1057 Chic…    1963  1963
 4 QA           Live …       2 Afgh…    5112 Stocks     1057 Chic…    1964  1964
 5 QA           Live …       2 Afgh…    5112 Stocks     1057 Chic…    1965  1965
 6 QA           Live …       2 Afgh…    5112 Stocks     1057 Chic…    1966  1966
 7 QA           Live …       2 Afgh…    5112 Stocks     1057 Chic…    1967  1967
 8 QA           Live …       2 Afgh…    5112 Stocks     1057 Chic…    1968  1968
 9 QA           Live …       2 Afgh…    5112 Stocks     1057 Chic…    1969  1969
10 QA           Live …       2 Afgh…    5112 Stocks     1057 Chic…    1970  1970
# … with 30,967 more rows, 4 more variables: Unit <chr>, Value <dbl>,
#   Flag <chr>, `Flag Description` <chr>, and abbreviated variable names
#   ¹​`Domain Code`, ²​`Area Code`, ³​`Element Code`, ⁴​`Item Code`, ⁵​`Year Code`
Code
summary(birds)
 Domain Code           Domain            Area Code        Area          
 Length:30977       Length:30977       Min.   :   1   Length:30977      
 Class :character   Class :character   1st Qu.:  79   Class :character  
 Mode  :character   Mode  :character   Median : 156   Mode  :character  
                                       Mean   :1202                     
                                       3rd Qu.: 231                     
                                       Max.   :5504                     
                                                                        
  Element Code    Element            Item Code        Item          
 Min.   :5112   Length:30977       Min.   :1057   Length:30977      
 1st Qu.:5112   Class :character   1st Qu.:1057   Class :character  
 Median :5112   Mode  :character   Median :1068   Mode  :character  
 Mean   :5112                      Mean   :1066                     
 3rd Qu.:5112                      3rd Qu.:1072                     
 Max.   :5112                      Max.   :1083                     
                                                                    
   Year Code         Year          Unit               Value         
 Min.   :1961   Min.   :1961   Length:30977       Min.   :       0  
 1st Qu.:1976   1st Qu.:1976   Class :character   1st Qu.:     171  
 Median :1992   Median :1992   Mode  :character   Median :    1800  
 Mean   :1991   Mean   :1991                      Mean   :   99411  
 3rd Qu.:2005   3rd Qu.:2005                      3rd Qu.:   15404  
 Max.   :2018   Max.   :2018                      Max.   :23707134  
                                                  NA's   :1036      
     Flag           Flag Description  
 Length:30977       Length:30977      
 Class :character   Class :character  
 Mode  :character   Mode  :character  
                                      
                                      
                                      
                                      

Describe the data

At first glance of the birds dataset I can see that there are 14 variables, 8 being character based and 6 being numeric. I was quick to notice that there was a single Element “Stocks” based on the summary statistics. Checking Domain and Flag for unique elements showed me that there was one Domain “Live Animals” and six unique flag descriptions describing how the data was obtained. Observations were made from 1961-2018.

Code
uniq_element <- select(birds,"Element")
num_elements <- unique(uniq_element)
num_elements
# A tibble: 1 × 1
  Element
  <chr>  
1 Stocks 
Code
uniq_domains <- select(birds,"Domain")
num_domains <- unique(uniq_domains)
num_domains
# A tibble: 1 × 1
  Domain      
  <chr>       
1 Live Animals
Code
uniq_flag <- select(birds,"Flag Description")
num_flags <- unique(uniq_flag)
num_flags
# A tibble: 6 × 1
  `Flag Description`                                                          
  <chr>                                                                       
1 FAO estimate                                                                
2 Official data                                                               
3 FAO data based on imputation methodology                                    
4 Data not available                                                          
5 Unofficial figure                                                           
6 Aggregate, may include official, semi-official, estimated or calculated data

Checking for unique Items and Areas showed me that there are five unique Items(Chickens, Ducks, Geese and fowls, Turkeys, and Pigeons/other) and a total of 248 Areas.

Code
uniq_area <- select(birds,"Area Code", Area)
num_areas <- unique(uniq_area)
num_areas
# A tibble: 248 × 2
   `Area Code` Area               
         <dbl> <chr>              
 1           2 Afghanistan        
 2           3 Albania            
 3           4 Algeria            
 4           5 American Samoa     
 5           7 Angola             
 6           8 Antigua and Barbuda
 7           9 Argentina          
 8           1 Armenia            
 9          22 Aruba              
10          10 Australia          
# … with 238 more rows
Code
uniq_item <- select(birds,"Item Code", Item)
num_items <- unique(uniq_item)
num_items
# A tibble: 5 × 2
  `Item Code` Item                  
        <dbl> <chr>                 
1        1057 Chickens              
2        1068 Ducks                 
3        1072 Geese and guinea fowls
4        1079 Turkeys               
5        1083 Pigeons, other birds  

For each observation, one unit is equal to 1000 head.

Code
uniq_unit <- select(birds,"Unit", Unit)
num_units <- unique(uniq_unit)
num_units
# A tibble: 1 × 1
  Unit     
  <chr>    
1 1000 Head

My initial thoughts are that the relevant data in this dataset seems to be Items, Areas, Area Code, Values, and Years. Another noteworthy observation I made was that the Values seem to jump to the millions between the 3rd Quartile and the Max. This makes me think there could be outliers in play.

Code
Values <- select(birds,"Value")
Areas <- select(birds,"Area")
arrange(Areas,desc(`Values`))
# A tibble: 30,977 × 1
   Area 
   <chr>
 1 World
 2 World
 3 World
 4 World
 5 World
 6 World
 7 World
 8 World
 9 World
10 World
# … with 30,967 more rows

By arranging the Areas by descending values I see that there are World values being used, which would severely affect the data, but that is a problem for another day.

Source Code
---
title: "Challenge 1"
author: "Jack Sniezek"
desription: "Reading in data and creating a post"
date: "11/28/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_1
  - railroads
  - faostat
  - wildbirds
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

## Challenge Overview

Today's challenge is to

1)  read in a dataset, and

2)  describe the dataset using both words and any supporting information (e.g., tables, etc)

## Read in the Data

-   birds.csv ⭐⭐


```{r}
birds <- read_csv("_data/birds.csv")
birds
summary(birds)
```


## Describe the data

At first glance of the birds dataset I can see that there are 14 variables, 8 being character based and 6 being numeric. I was quick to notice that there was a single Element "Stocks" based on the summary statistics. Checking Domain and Flag for unique elements showed me that there was one Domain "Live Animals" and six unique flag descriptions describing how the data was obtained. Observations were made from 1961-2018.

```{r}
#| label: birds data wrangling

uniq_element <- select(birds,"Element")
num_elements <- unique(uniq_element)
num_elements

uniq_domains <- select(birds,"Domain")
num_domains <- unique(uniq_domains)
num_domains

uniq_flag <- select(birds,"Flag Description")
num_flags <- unique(uniq_flag)
num_flags

```

Checking for unique Items and Areas showed me that there are five unique Items(Chickens, Ducks, Geese and fowls, Turkeys, and Pigeons/other) and a total of 248 Areas.


```{r}
uniq_area <- select(birds,"Area Code", Area)
num_areas <- unique(uniq_area)
num_areas

uniq_item <- select(birds,"Item Code", Item)
num_items <- unique(uniq_item)
num_items
```

For each observation, one unit is equal to 1000 head.
```{r}
uniq_unit <- select(birds,"Unit", Unit)
num_units <- unique(uniq_unit)
num_units
```

My initial thoughts are that the relevant data in this dataset seems to be Items, Areas, Area Code, Values, and Years. Another noteworthy observation I made was that the Values seem to jump to the millions between the 3rd Quartile and the Max. This makes me think there could be outliers in play.

```{r}
Values <- select(birds,"Value")
Areas <- select(birds,"Area")
arrange(Areas,desc(`Values`))

```

By arranging the Areas by descending values I see that there are World values being used, which would severely affect the data, but that is a problem for another day.