DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 2

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Challenge Overview
  • Read in the Data
  • Describe the data
  • Grouped Summary Statistics
    • Explain and Interpret

Challenge 2

  • Show All Code
  • Hide All Code

  • View Source
challenge_2
railroads
faostat
hotel_bookings
Author

Matthew O’Neill

Published

October 5, 2022

Code
library(tidyverse)
library(dplyr)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to

  1. read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
  2. provide summary statistics for different interesting groups within the data, and interpret those statistics

Read in the Data

Read in one (or more) of the following data sets, available in the posts/_data folder, using the correct R package and command.

  • railroad*.csv or StateCounty2012.xls ⭐
  • FAOstat*.csv or birds.csv ⭐⭐⭐
  • hotel_bookings.csv ⭐⭐⭐⭐
Code
data <- read_csv("../posts/_data/birds.csv")

Describe the data

Below is the first few rows of the dataset. It is very similar to the FAO dairy dataset, which I looked through in Challenge 1. At first glance, thess data appear to be a lot simpler than the dairy data, as we may just be dealing with the count of chickens accross fars in different Countries over time. But there may be more bird types that we don’t see right away. The data was once again likely gather by the Food and Agriculture Organization in the US.

Code
head(data)
# A tibble: 6 × 14
  Domai…¹ Domain Area …² Area  Eleme…³ Element Item …⁴ Item  Year …⁵  Year Unit 
  <chr>   <chr>    <dbl> <chr>   <dbl> <chr>     <dbl> <chr>   <dbl> <dbl> <chr>
1 QA      Live …       2 Afgh…    5112 Stocks     1057 Chic…    1961  1961 1000…
2 QA      Live …       2 Afgh…    5112 Stocks     1057 Chic…    1962  1962 1000…
3 QA      Live …       2 Afgh…    5112 Stocks     1057 Chic…    1963  1963 1000…
4 QA      Live …       2 Afgh…    5112 Stocks     1057 Chic…    1964  1964 1000…
5 QA      Live …       2 Afgh…    5112 Stocks     1057 Chic…    1965  1965 1000…
6 QA      Live …       2 Afgh…    5112 Stocks     1057 Chic…    1966  1966 1000…
# … with 3 more variables: Value <dbl>, Flag <chr>, `Flag Description` <chr>,
#   and abbreviated variable names ¹​`Domain Code`, ²​`Area Code`,
#   ³​`Element Code`, ⁴​`Item Code`, ⁵​`Year Code`

Below is a breakdown of item types in our dataset, and it is clear we are workign with more than just Chickens. The data also includes Turkeys, Ducks, Geese, and Pigeons. It’s likely that different countries will produce differnt proportions of each.

Code
item <- select(data, "Item")
table(item)
Item
              Chickens                  Ducks Geese and guinea fowls 
                 13074                   6909                   4136 
  Pigeons, other birds                Turkeys 
                  1165                   5693 

Grouped Summary Statistics

Below we can see two tables, one showing the mean number of each kind of bird in each area. The second sorts these numbers in order to show that the largest count of birds in any one country is Chickens with 1.5 Million in total.

Code
bird_type <- data %>%
  group_by(`Area`,`Item`) %>%
  replace_na(list(`Value` = 0))%>%
  summarise(mean(`Value`))

bird_type
# A tibble: 601 × 3
# Groups:   Area [248]
   Area        Item                   `mean(Value)`
   <chr>       <chr>                          <dbl>
 1 Afghanistan Chickens                       8099.
 2 Africa      Chickens                     936779.
 3 Africa      Ducks                         13639.
 4 Africa      Geese and guinea fowls        12164.
 5 Africa      Pigeons, other birds          11222.
 6 Africa      Turkeys                        9004.
 7 Albania     Chickens                       4055.
 8 Albania     Ducks                           173.
 9 Albania     Geese and guinea fowls          123 
10 Albania     Turkeys                         323.
# … with 591 more rows
Code
bird_type %>%
  arrange(desc(`mean(Value)`))
# A tibble: 601 × 3
# Groups:   Area [248]
   Area                     Item     `mean(Value)`
   <chr>                    <chr>            <dbl>
 1 World                    Chickens     11624407.
 2 Asia                     Chickens      5498104.
 3 Americas                 Chickens      3163543.
 4 Eastern Asia             Chickens      2843125.
 5 China, mainland          Chickens      2417047.
 6 Europe                   Chickens      1945862.
 7 Northern America         Chickens      1518820.
 8 United States of America Chickens      1398810.
 9 South-eastern Asia       Chickens      1261959.
10 South America            Chickens      1150118.
# … with 591 more rows

Seeing as the United States seems to be one of the largest producers of birds for meat, let’s take a closer look at their production.

Code
data %>%
  filter(`Area`=="United States of America")%>%
  select(`Area`,`Item`,`Year Code`,`Value`)%>%
  group_by(`Item`)%>%
  summarize(mean = mean(`Value`),median = median(`Value`), std = sd(`Value`), max = max(`Value`), min = min(`Value`))
# A tibble: 3 × 6
  Item         mean  median     std     max    min
  <chr>       <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
1 Chickens 1398810. 1305000 463816. 2015000 751000
2 Ducks       5683.    6225   1485.    7900   3400
3 Turkeys   206418.  240219  69420.  302713  91837

Explain and Interpret

It appears that the US produces mostly Chicken, with a mean and median production over the years of around 1.3 million chickens. They produced far fewer Turkeys over the same time period, with an average of 206K per year, and barely any Ducks in comparison to the previous two subgroups. We also see that the US did not produce any Geese or Pigeons as food.

Comparing maximum and minimum values over time, it’s interesting to point out that the best year for Turkey production was still less than half that of the worst year for Chicken production. This makes some sense knowing US culture, as Turkey is a much more seasonal meat than Chicken.

Finally it’s important to point out that there was quite a bit of variance in Turkey production, with a standard deviation of nearly 70K per year despite it’s mean.

Source Code
---
title: "Challenge 2"
author: "Matthew O'Neill"
desription: "Data wrangling: using group() and summarise()"
date: "10/05/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_2
  - railroads
  - faostat
  - hotel_bookings
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)
library(dplyr)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

## Challenge Overview

Today's challenge is to

1)  read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
2)  provide summary statistics for different interesting groups within the data, and interpret those statistics

## Read in the Data

Read in one (or more) of the following data sets, available in the `posts/_data` folder, using the correct R package and command.

-   railroad\*.csv or StateCounty2012.xls ⭐
-   FAOstat\*.csv or birds.csv ⭐⭐⭐
-   hotel_bookings.csv ⭐⭐⭐⭐

```{r}
data <- read_csv("../posts/_data/birds.csv")
```


## Describe the data

Below is the first few rows of the dataset. It is very similar to the FAO dairy dataset, which I looked through in Challenge 1. At first glance, thess data appear to be a lot simpler than the dairy data, as we may just be dealing with the count of chickens accross fars in different Countries over time. But there may be more bird types that we don't see right away. The data was once again likely gather by the Food and Agriculture Organization in the US. 

```{r}
#| label: summary

head(data)

```
Below is a breakdown of item types in our dataset, and it is clear we are workign with more than just Chickens. The data also includes Turkeys, Ducks, Geese, and Pigeons. It's likely that different countries will produce differnt proportions of each.

```{r}
#| label: Item

item <- select(data, "Item")
table(item)

```


## Grouped Summary Statistics

Below we can see two tables, one showing the mean number of each kind of bird in each area. The second sorts these numbers in order to show that the largest count of birds in any one country is Chickens with 1.5 Million in total.

```{r}
bird_type <- data %>%
  group_by(`Area`,`Item`) %>%
  replace_na(list(`Value` = 0))%>%
  summarise(mean(`Value`))

bird_type
bird_type %>%
  arrange(desc(`mean(Value)`))
  

```
Seeing as the United States seems to be one of the largest producers of birds for meat, let's take a closer look at their production.

```{r}
data %>%
  filter(`Area`=="United States of America")%>%
  select(`Area`,`Item`,`Year Code`,`Value`)%>%
  group_by(`Item`)%>%
  summarize(mean = mean(`Value`),median = median(`Value`), std = sd(`Value`), max = max(`Value`), min = min(`Value`))

  

```
### Explain and Interpret

It appears that the US produces mostly Chicken, with a mean and median production over the years of around 1.3 million chickens. They produced far fewer Turkeys over the same time period, with an average of 206K per year, and barely any Ducks in comparison to the previous two subgroups. We also see that the US did not produce any Geese or Pigeons as food.

Comparing maximum and minimum values over time, it's interesting to point out that the best year for Turkey production was still less than half that of the worst year for Chicken production. This makes some sense knowing US culture, as Turkey is a much more seasonal meat than Chicken.

Finally it's important to point out that there was quite a bit of variance in Turkey production, with a standard deviation of nearly 70K per year despite it's mean.