DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 2

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Challenge Overview
  • Read in the Data Birds
  • Describe the data
  • Provide Grouped Summary Statistics
    • Explain and Interpret

Challenge 2

  • Show All Code
  • Hide All Code

  • View Source
challenge_2
birds
Theresa_Szczepanski
Author

Theresa Szczepanski

Published

September 19, 2022

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to

  1. read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
  2. provide summary statistics for different interesting groups within the data, and interpret those statistics

Read in the Data Birds

Data Source

  • birds.csv ⭐⭐⭐
Code
  Birds <- read_csv("_data/birds.csv")
  head(Birds)
# A tibble: 6 × 14
  Domai…¹ Domain Area …² Area  Eleme…³ Element Item …⁴ Item  Year …⁵  Year Unit 
  <chr>   <chr>    <dbl> <chr>   <dbl> <chr>     <dbl> <chr>   <dbl> <dbl> <chr>
1 QA      Live …       2 Afgh…    5112 Stocks     1057 Chic…    1961  1961 1000…
2 QA      Live …       2 Afgh…    5112 Stocks     1057 Chic…    1962  1962 1000…
3 QA      Live …       2 Afgh…    5112 Stocks     1057 Chic…    1963  1963 1000…
4 QA      Live …       2 Afgh…    5112 Stocks     1057 Chic…    1964  1964 1000…
5 QA      Live …       2 Afgh…    5112 Stocks     1057 Chic…    1965  1965 1000…
6 QA      Live …       2 Afgh…    5112 Stocks     1057 Chic…    1966  1966 1000…
# … with 3 more variables: Value <dbl>, Flag <chr>, `Flag Description` <chr>,
#   and abbreviated variable names ¹​`Domain Code`, ²​`Area Code`,
#   ³​`Element Code`, ⁴​`Item Code`, ⁵​`Year Code`
Code
  summary(Birds)
 Domain Code           Domain            Area Code        Area          
 Length:30977       Length:30977       Min.   :   1   Length:30977      
 Class :character   Class :character   1st Qu.:  79   Class :character  
 Mode  :character   Mode  :character   Median : 156   Mode  :character  
                                       Mean   :1202                     
                                       3rd Qu.: 231                     
                                       Max.   :5504                     
                                                                        
  Element Code    Element            Item Code        Item          
 Min.   :5112   Length:30977       Min.   :1057   Length:30977      
 1st Qu.:5112   Class :character   1st Qu.:1057   Class :character  
 Median :5112   Mode  :character   Median :1068   Mode  :character  
 Mean   :5112                      Mean   :1066                     
 3rd Qu.:5112                      3rd Qu.:1072                     
 Max.   :5112                      Max.   :1083                     
                                                                    
   Year Code         Year          Unit               Value         
 Min.   :1961   Min.   :1961   Length:30977       Min.   :       0  
 1st Qu.:1976   1st Qu.:1976   Class :character   1st Qu.:     171  
 Median :1992   Median :1992   Mode  :character   Median :    1800  
 Mean   :1991   Mean   :1991                      Mean   :   99411  
 3rd Qu.:2005   3rd Qu.:2005                      3rd Qu.:   15404  
 Max.   :2018   Max.   :2018                      Max.   :23707134  
                                                  NA's   :1036      
     Flag           Flag Description  
 Length:30977       Length:30977      
 Class :character   Class :character  
 Mode  :character   Mode  :character  
                                      
                                      
                                      
                                      
Code
#| label: birds read in/summary

Describe the data

The birds data consists of 8 variables of character type and 6 variables of double type. The data seem to describe the population of stock or domesticated fowl in regions of the world for given years between 1961 and 2018. The character variables have an associated numeric code variable.

Domain, Element: For this data set all of the cases have the same Domain and Domain Code representing “live animals” and the same Element and Element Code representing stocks. The term stock seems to indicate that the animals represent domesticated stock rather than wild fowl.

Code
  Domains <-select(Birds, "Domain Code", Domain)
  Num_Domains <-unique(Domains)
  Num_Domains
# A tibble: 1 × 2
  `Domain Code` Domain      
  <chr>         <chr>       
1 QA            Live Animals
Code
  Elements <-select(Birds, "Element Code", Element)
  Num_Elements <-unique(Elements)
  Num_Elements
# A tibble: 1 × 2
  `Element Code` Element
           <dbl> <chr>  
1           5112 Stocks 
Code
#| label: Domains /Elements info

Item: For this data set, all of the observations are of items of type chicken, duck, geese and guinea fowls, turkeys, or pigeons/other birds.

Code
  Items <-select(Birds, "Item Code", Item)
  Num_Items <-unique(Items)
  Num_Items
# A tibble: 5 × 2
  `Item Code` Item                  
        <dbl> <chr>                 
1        1057 Chickens              
2        1068 Ducks                 
3        1072 Geese and guinea fowls
4        1079 Turkeys               
5        1083 Pigeons, other birds  
Code
#| label: Items info

Area consists of 248 entries. Notably, the entries with values less than 5000 represent countries of the world. The numeric codes correspond with the the alphabetical order of the country names. The remaining codes greater than 5000, correspond to regions of the world rather than a specific country. In these cases, regions with numbers closer in value seem to have closer geographic proximity. It should be noted that there is a value for Europe as well as a value for Eastern Europe and Western Europe, so there are regions that are represented in multiple cases of these entries.

Code
  Areas <-select(Birds, "Area Code", Area)
  Num_Areas <-unique(Areas)
  Num_Areas
# A tibble: 248 × 2
   `Area Code` Area               
         <dbl> <chr>              
 1           2 Afghanistan        
 2           3 Albania            
 3           4 Algeria            
 4           5 American Samoa     
 5           7 Angola             
 6           8 Antigua and Barbuda
 7           9 Argentina          
 8           1 Armenia            
 9          22 Aruba              
10          10 Australia          
# … with 238 more rows
Code
  arrange(Num_Areas, `Area Code`)
# A tibble: 248 × 2
   `Area Code` Area               
         <dbl> <chr>              
 1           1 Armenia            
 2           2 Afghanistan        
 3           3 Albania            
 4           4 Algeria            
 5           5 American Samoa     
 6           7 Angola             
 7           8 Antigua and Barbuda
 8           9 Argentina          
 9          10 Australia          
10          11 Austria            
# … with 238 more rows
Code
  arrange(Num_Areas, desc(`Area Code`))
# A tibble: 248 × 2
   `Area Code` Area                     
         <dbl> <chr>                    
 1        5504 Polynesia                
 2        5503 Micronesia               
 3        5502 Melanesia                
 4        5501 Australia and New Zealand
 5        5500 Oceania                  
 6        5404 Western Europe           
 7        5403 Southern Europe          
 8        5402 Northern Europe          
 9        5401 Eastern Europe           
10        5400 Europe                   
# … with 238 more rows
Code
  World_Region <- filter(Num_Areas, `Area Code` >= 5000)
  arrange(World_Region, `Area Code`)
# A tibble: 28 × 2
   `Area Code` Area            
         <dbl> <chr>           
 1        5000 World           
 2        5100 Africa          
 3        5101 Eastern Africa  
 4        5102 Middle Africa   
 5        5103 Northern Africa 
 6        5104 Southern Africa 
 7        5105 Western Africa  
 8        5200 Americas        
 9        5203 Northern America
10        5204 Central America 
# … with 18 more rows
Code
#| label: Area info

Unit, Value: For a given observation, there is the year the observation was made (between 1961 and 2018), and the number of stock counted as a value with units of 1000 head. 4700 represents, 4,700,000 heads of the given type of bird observed.

Code
  Birds_Values <-select(Birds, Unit, Value)
  Units <-select(Birds, Unit)
  Num_Units <-unique(Units)
  Num_Units
# A tibble: 1 × 1
  Unit     
  <chr>    
1 1000 Head
Code
  summary(Birds_Values)
     Unit               Value         
 Length:30977       Min.   :       0  
 Class :character   1st Qu.:     171  
 Mode  :character   Median :    1800  
                    Mean   :   99411  
                    3rd Qu.:   15404  
                    Max.   :23707134  
                    NA's   :1036      
Code
  Birds_Values
# A tibble: 30,977 × 2
   Unit      Value
   <chr>     <dbl>
 1 1000 Head  4700
 2 1000 Head  4900
 3 1000 Head  5000
 4 1000 Head  5300
 5 1000 Head  5500
 6 1000 Head  5800
 7 1000 Head  6600
 8 1000 Head  6290
 9 1000 Head  6300
10 1000 Head  6000
# … with 30,967 more rows
Code
#| label: Years info

Flag consists of 6 values describing the methodology by which the data was collected.

Code
  Flag_Descriptions <-select(Birds, Flag, `Flag Description`)
  Num_Flag_Descriptions <-unique(Flag_Descriptions)
  Num_Flag_Descriptions
# A tibble: 6 × 2
  Flag  `Flag Description`                                                      
  <chr> <chr>                                                                   
1 F     FAO estimate                                                            
2 <NA>  Official data                                                           
3 Im    FAO data based on imputation methodology                                
4 M     Data not available                                                      
5 *     Unofficial figure                                                       
6 A     Aggregate, may include official, semi-official, estimated or calculated…
Code
  Flags <-select(Birds, Flag)
  table(Flags)
Flag
    *     A     F    Im     M 
 1494  6488 10007  1213  1002 
Code
#| label: Flags info

For the Birds data set, each case provides an estimate for the population of domesticated fowl for a given type of bird, in a given region of the world, for a given year.

Provide Grouped Summary Statistics

When considering the data filtered by cases with Area = World, there is global aggregate data for each type of bird per year. Considering the measures of central tendency by item shows that chickens are the dominant domesticated fowl globally. The measures of dispersion, indicate that the rise of the domesticated Chicken population since 1961 is much more extreme than that of the other domesticated fowl.

Code
World_Data <-filter(Birds, `Area Code` == 5000)
# summary(World_Data)

# World_Flags <-select(World_Data, `Flag Description`)
#vNum_World_Flags <- unique(World_Flags)
# Num_World_Flags
World_Item <- World_Data %>% group_by(Item)
# World_Item
World_Item %>% summarise(mean = mean(Value, na.rm = TRUE), median = median(Value, na.rm =TRUE), sd = sd(Value, na.rm = TRUE), max = max(Value), min = min(Value), range = max-min, var = var(Value))
# A tibble: 5 × 8
  Item                        mean    median     sd    max    min  range     var
  <chr>                      <dbl>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>
1 Chickens               11624407. 10436552. 6.13e6 2.37e7 3.91e6 1.98e7 3.76e13
2 Ducks                    645609.   544471  3.58e5 1.20e6 1.93e5 1.00e6 1.28e11
3 Geese and guinea fowls   177314.   124515  1.24e5 3.91e5 3.66e4 3.54e5 1.55e10
4 Pigeons, other birds      29409.    32222  1.15e4 5.79e4 1.21e4 4.58e4 1.32e 8
5 Turkeys                  352802.   421909  1.16e5 4.74e5 1.54e5 3.20e5 1.35e10
Code
World_Data <- select(World_Data, Item, Year, Value )

World_Data_by_Item <-select(World_Data, Item, Value)
World_Data_by_Item <- pivot_wider(World_Data, names_from = `Item`, values_from = `Value`)
World_Data_by_Item
# A tibble: 58 × 6
    Year Chickens  Ducks `Geese and guinea fowls` `Pigeons, other birds` Turkeys
   <dbl>    <dbl>  <dbl>                    <dbl>                  <dbl>   <dbl>
 1  1961  3906690 193452                    36640                  14055  204241
 2  1962  4048728 201167                    37737                  16026  174077
 3  1963  4163131 210275                    38489                  17018  161262
 4  1964  4231221 216183                    40928                  17963  153758
 5  1965  4349674 222799                    43523                  18869  154790
 6  1966  4445629 229837                    44491                  19676  166655
 7  1967  4666511 237028                    48028                  19860  174158
 8  1968  4823170 245201                    50302                  12068  155205
 9  1969  4988438 249936                    52459                  12656  157950
10  1970  5209733 256318                    54578                  13219  178971
# … with 48 more rows
Code
  #global summary statistics by item.

When considering the change in value of each item over time (this would best be visualized with line plots of item values on the y-axis and year on the x-axis):

  • The world Turkey population seems to have steadily increased from 1961-1990. From 1990-2018 the population of Turkey is consistently larger than the previous 30 years but has not grown incrementally grown year to year.
  • The world chicken population seems to have consistently increased year to year from 1961-2018.
  • The world duck population seems to have consistently increased year-to-year until 2004.
  • The world geese and guinea fowl population seems to have consistently increased year-to-year until 1993.
  • The world pigeon and other bird population has much more variation in the year to year population changes. This suggests that trends in global production, domestication, and consumption/use of chickens, ducks, turkeys, and geese over the last 60 years is much different than that of pigeons and other birds.
Code
arrange(World_Data_by_Item, `Year`)
# A tibble: 58 × 6
    Year Chickens  Ducks `Geese and guinea fowls` `Pigeons, other birds` Turkeys
   <dbl>    <dbl>  <dbl>                    <dbl>                  <dbl>   <dbl>
 1  1961  3906690 193452                    36640                  14055  204241
 2  1962  4048728 201167                    37737                  16026  174077
 3  1963  4163131 210275                    38489                  17018  161262
 4  1964  4231221 216183                    40928                  17963  153758
 5  1965  4349674 222799                    43523                  18869  154790
 6  1966  4445629 229837                    44491                  19676  166655
 7  1967  4666511 237028                    48028                  19860  174158
 8  1968  4823170 245201                    50302                  12068  155205
 9  1969  4988438 249936                    52459                  12656  157950
10  1970  5209733 256318                    54578                  13219  178971
# … with 48 more rows
Code
#arrange(World_Data_by_Item, `Turkeys`)
#arrange(World_Data_by_Item, `Chickens`)
#arrange(World_Data_by_Item, `Ducks`)
#arrange(World_Data_by_Item, `Geese and guinea fowls`)
#arrange(World_Data_by_Item, `Pigeons, other birds`)




  #perform global analysis by item of value over time 

Explain and Interpret

Global domesticated production and consumption of chickens, turkeys, ducks, and geese has steadily increased from 1961-1990; however pigeons and other birds do not see this same pattern. Perhaps there was technological innovation during this period that allowed for a large scale increase in the capacity of farms to support this growth. Perhaps the increase was also necessitated by general population growth and the globalization of farming in this time period. Global production of chickens has seen the most extreme growth in this period. It would be worthwhile to explore the preference of items and growth of the value fields by regions of the world.

##Further Challenge to attempt later - hotel_bookings.csv ⭐⭐⭐⭐

Source Code
---
title: "Challenge 2"
author: "Theresa Szczepanski"
desription: "Data wrangling: using group() and summarise()"
date: "09/19/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_2
  - birds
  - Theresa_Szczepanski
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

## Challenge Overview

Today's challenge is to

1)  read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
2)  provide summary statistics for different interesting groups within the data, and interpret those statistics

## Read in the Data Birds

Data Source

-   birds.csv ⭐⭐⭐

```{r}
  Birds <- read_csv("_data/birds.csv")
  head(Birds)
  summary(Birds)
 
  
#| label: birds read in/summary
```

## Describe the data

The `birds` data consists of 8 variables of character type and 6 variables of double type. The data seem to describe the population of `stock` or domesticated fowl in regions of the world for given years between 1961 and 2018. The character variables have an associated numeric code variable.

`Domain`, `Element`: For this data set all of the cases have the same `Domain` and `Domain Code` representing "live animals" and the same `Element` and `Element Code` representing `stocks`. The term `stock` seems to indicate that the animals represent domesticated stock rather than wild fowl.
 
```{r}
  Domains <-select(Birds, "Domain Code", Domain)
  Num_Domains <-unique(Domains)
  Num_Domains
  Elements <-select(Birds, "Element Code", Element)
  Num_Elements <-unique(Elements)
  Num_Elements
  
  
#| label: Domains /Elements info
``` 

`Item`: For this data set, all of the observations are of items of type chicken, duck, geese and guinea fowls, turkeys, or pigeons/other birds.
 
```{r}
  Items <-select(Birds, "Item Code", Item)
  Num_Items <-unique(Items)
  Num_Items
  
  
  
#| label: Items info
``` 


`Area` consists of 248 entries. Notably, the entries with values less than 5000 represent countries of the world. The numeric codes correspond with the the alphabetical order of the country names. The remaining codes greater than 5000, correspond to regions of the world rather than a specific country. In these cases, regions with numbers closer in value seem to have closer geographic proximity. It should be noted that there is a value for Europe as well as a value for Eastern Europe and Western Europe, so there are regions that are represented in multiple cases of these entries.
```{r}
  Areas <-select(Birds, "Area Code", Area)
  Num_Areas <-unique(Areas)
  Num_Areas
  arrange(Num_Areas, `Area Code`)
  arrange(Num_Areas, desc(`Area Code`))
  World_Region <- filter(Num_Areas, `Area Code` >= 5000)
  arrange(World_Region, `Area Code`)
  
#| label: Area info
``` 

 `Unit`, `Value`: For a given observation, there is the year the observation was made (between 1961 and 2018), and the number of `stock` counted as a `value` with `units` of `1000 head`. 4700 represents, 4,700,000 heads of the given type of bird observed.
```{r}
  Birds_Values <-select(Birds, Unit, Value)
  Units <-select(Birds, Unit)
  Num_Units <-unique(Units)
  Num_Units
  summary(Birds_Values)
  Birds_Values
  
#| label: Years info
``` 



`Flag` consists of 6 values describing the methodology by which the data was collected.

```{r}
  Flag_Descriptions <-select(Birds, Flag, `Flag Description`)
  Num_Flag_Descriptions <-unique(Flag_Descriptions)
  Num_Flag_Descriptions
  Flags <-select(Birds, Flag)
  table(Flags)
  
  
#| label: Flags info
```



For the `Birds` data set, each case provides an estimate for the population of domesticated fowl for a given type of bird, in a given region of the world, for a given year.

## Provide Grouped Summary Statistics

When considering the data filtered by cases with `Area` = `World`, there is global aggregate data for each type of bird per year. Considering the measures of central tendency by item shows that chickens are the dominant domesticated fowl globally. The measures of dispersion, indicate that the rise of the domesticated Chicken population since 1961 is much more extreme than that of the other domesticated fowl.

```{r}
World_Data <-filter(Birds, `Area Code` == 5000)
# summary(World_Data)

# World_Flags <-select(World_Data, `Flag Description`)
#vNum_World_Flags <- unique(World_Flags)
# Num_World_Flags
World_Item <- World_Data %>% group_by(Item)
# World_Item
World_Item %>% summarise(mean = mean(Value, na.rm = TRUE), median = median(Value, na.rm =TRUE), sd = sd(Value, na.rm = TRUE), max = max(Value), min = min(Value), range = max-min, var = var(Value))




World_Data <- select(World_Data, Item, Year, Value )

World_Data_by_Item <-select(World_Data, Item, Value)
World_Data_by_Item <- pivot_wider(World_Data, names_from = `Item`, values_from = `Value`)
World_Data_by_Item





  #global summary statistics by item.
```



When considering the change in value of each item over time (this would best be visualized with line plots of item values on the y-axis and year on the x-axis):

- The world Turkey population seems to have steadily increased from 1961-1990. From 1990-2018 the population of Turkey is consistently larger than the previous 30 years but has not grown incrementally grown year to year. 
- The world chicken population seems to have consistently increased year to year from 1961-2018. 
- The world duck population seems to have consistently increased year-to-year until 2004.
- The world geese and guinea fowl population seems to have consistently increased year-to-year until 1993.
- The world pigeon and other bird population has much more variation in the year to year population changes. This suggests that trends in global production, domestication, and consumption/use of chickens, ducks, turkeys, and geese over the last 60 years is much different than that of pigeons and other birds.

```{r}
arrange(World_Data_by_Item, `Year`)
#arrange(World_Data_by_Item, `Turkeys`)
#arrange(World_Data_by_Item, `Chickens`)
#arrange(World_Data_by_Item, `Ducks`)
#arrange(World_Data_by_Item, `Geese and guinea fowls`)
#arrange(World_Data_by_Item, `Pigeons, other birds`)




  #perform global analysis by item of value over time 
```

### Explain and Interpret

Global domesticated production and consumption of chickens, turkeys, ducks, and geese has steadily increased from 1961-1990; however pigeons and other birds do not see this same pattern. Perhaps there was technological innovation during this period that allowed for a large scale increase in the capacity of farms to support this growth. Perhaps the increase was also necessitated by general population growth and the globalization of farming in this time period. Global production of chickens has seen the most extreme growth in this period. It would be worthwhile to explore the preference of items and growth of the value fields by regions of the world. 

##Further Challenge to attempt later
-   hotel_bookings.csv ⭐⭐⭐⭐