DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 1 - Darron Bunt

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Challenge Overview
  • Step 1 - Read in the Data
  • Step 2 - Describe the data

Challenge 1 - Darron Bunt

  • Show All Code
  • Hide All Code

  • View Source
challenge_1
birds
darron bunt
Author

Darron Bunt

Published

October 9, 2022

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to

  1. read in a dataset, and

  2. describe the dataset using both words and any supporting information (e.g., tables, etc)

Step 1 - Read in the Data

Read in one (or more) of the following data sets, using the correct R package and command.

  • birds.csv ⭐⭐
Code
birds <- read_csv("_data/birds.csv")

Add any comments or documentation as needed. More challenging data sets may require additional code chunks and documentation.

Step 2 - Describe the data

Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).

Ok, so from what’s been read in above, we know that the birds dataset has 30,977 rows and 14 columns. Eight of those columns are character-based, while the remaining six are number-based. Neat.

So now if I run birds, I should get a tibble, and in theory that tibble is going to help me perform a high-level description of the data.

Code
birds
# A tibble: 30,977 × 14
   Domain Cod…¹ Domain Area …² Area  Eleme…³ Element Item …⁴ Item  Year …⁵  Year
   <chr>        <chr>    <dbl> <chr>   <dbl> <chr>     <dbl> <chr>   <dbl> <dbl>
 1 QA           Live …       2 Afgh…    5112 Stocks     1057 Chic…    1961  1961
 2 QA           Live …       2 Afgh…    5112 Stocks     1057 Chic…    1962  1962
 3 QA           Live …       2 Afgh…    5112 Stocks     1057 Chic…    1963  1963
 4 QA           Live …       2 Afgh…    5112 Stocks     1057 Chic…    1964  1964
 5 QA           Live …       2 Afgh…    5112 Stocks     1057 Chic…    1965  1965
 6 QA           Live …       2 Afgh…    5112 Stocks     1057 Chic…    1966  1966
 7 QA           Live …       2 Afgh…    5112 Stocks     1057 Chic…    1967  1967
 8 QA           Live …       2 Afgh…    5112 Stocks     1057 Chic…    1968  1968
 9 QA           Live …       2 Afgh…    5112 Stocks     1057 Chic…    1969  1969
10 QA           Live …       2 Afgh…    5112 Stocks     1057 Chic…    1970  1970
# … with 30,967 more rows, 4 more variables: Unit <chr>, Value <dbl>,
#   Flag <chr>, `Flag Description` <chr>, and abbreviated variable names
#   ¹​`Domain Code`, ²​`Area Code`, ³​`Element Code`, ⁴​`Item Code`, ⁵​`Year Code`

The data appears to show the worldwide data on the historical value of 1,000 head of five different birds. Specifically, the dataset includes information relating to chickens, ducks, geeese and guinea fowls, pigeons and other birds, and turkeys, from 601 areas of the world (some countries, some regions), dating from 1961 to 2018.

Code
count(birds, Item)
# A tibble: 5 × 2
  Item                       n
  <chr>                  <int>
1 Chickens               13074
2 Ducks                   6909
3 Geese and guinea fowls  4136
4 Pigeons, other birds    1165
5 Turkeys                 5693
Code
count(birds, Area)
# A tibble: 248 × 2
   Area                    n
   <chr>               <int>
 1 Afghanistan            58
 2 Africa                290
 3 Albania               232
 4 Algeria               232
 5 American Samoa         58
 6 Americas              232
 7 Angola                 58
 8 Antigua and Barbuda    58
 9 Argentina             232
10 Armenia                54
# … with 238 more rows
Code
count(birds,Year)
# A tibble: 58 × 2
    Year     n
   <dbl> <int>
 1  1961   493
 2  1962   493
 3  1963   493
 4  1964   493
 5  1965   494
 6  1966   495
 7  1967   495
 8  1968   495
 9  1969   498
10  1970   498
# … with 48 more rows

Judging by the flag descriptions, this data has come from a variety of sources, most commonly FAO (Food and Agriculture Organization) estimates and official data.

Code
count(birds,`Flag Description`)
# A tibble: 6 × 2
  `Flag Description`                                                           n
  <chr>                                                                    <int>
1 Aggregate, may include official, semi-official, estimated or calculated…  6488
2 Data not available                                                        1002
3 FAO data based on imputation methodology                                  1213
4 FAO estimate                                                             10007
5 Official data                                                            10773
6 Unofficial figure                                                         1494

Several columns contain repetitive data; the value for Domain Code and Domain is the same across all entries in the dataset (QA for the former; Live Animals for the latter), as is the value for Element Code and Element (5112 and Stocks, respecitvely). The columns for Year Code and Year repeat the same data. The Unit is also the same for the entire dataset (1,000 head).

I used a variety of count commands to ascertain the above; for reference I have included that for Domain Code and Domain.

Code
count(birds,`Domain Code`)
# A tibble: 1 × 2
  `Domain Code`     n
  <chr>         <int>
1 QA            30977
Code
count(birds, Domain)
# A tibble: 1 × 2
  Domain           n
  <chr>        <int>
1 Live Animals 30977
Source Code
---
title: "Challenge 1 - Darron Bunt"
author: "Darron Bunt"
desription: "Reading in data and creating a post"
date: "10/09/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_1
  - birds
  - darron bunt
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

## Challenge Overview

Today's challenge is to

1)  read in a dataset, and

2)  describe the dataset using both words and any supporting information (e.g., tables, etc)

## Step 1 - Read in the Data

*Read in one (or more) of the following data sets, using the correct R package and command.*

-   birds.csv ⭐⭐

```{r}
birds <- read_csv("_data/birds.csv")
```

*Add any comments or documentation as needed. More challenging data sets may require additional code chunks and documentation.*

## Step 2 - Describe the data

*Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).*

Ok, so from what's been read in above, we know that the birds dataset has 30,977 rows and 14 columns. Eight of those columns are character-based, while the remaining six are number-based. Neat.

So now if I run birds, I should get a tibble, and in theory that tibble is going to help me perform a high-level description of the data. 

```{r}
birds
```
The data appears to show the worldwide data on the historical value of 1,000 head of five different birds. Specifically, the dataset includes information relating to chickens, ducks, geeese and guinea fowls, pigeons and other birds, and turkeys, from 601 areas of the world (some countries, some regions), dating from 1961 to 2018.     

```{r}
count(birds, Item)
count(birds, Area)
count(birds,Year)
```
Judging by the flag descriptions, this data has come from a variety of sources, most commonly FAO (Food and Agriculture Organization) estimates and official data. 

```{r}
count(birds,`Flag Description`)
```
Several columns contain repetitive data; the value for Domain Code and Domain is the same across all entries in the dataset (QA for the former; Live Animals for the latter), as is the value for Element Code and Element (5112 and Stocks, respecitvely). The columns for Year Code and Year repeat the same data. The Unit is also the same for the entire dataset (1,000 head). 

I used a variety of count commands to ascertain the above; for reference I have included that for Domain Code and Domain.

```{r}
count(birds,`Domain Code`)
count(birds, Domain)
```

```{r}
#| label: summary

```