DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 2

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Challenge Overview
  • Read in the Data
  • Describe the data
  • Provide Grouped Summary Statistics
    • Explain and Interpret

Challenge 2

  • Show All Code
  • Hide All Code

  • View Source
challenge_2
railroads
faostat
hotel_bookings
Author

Jack Sniezek

Published

November 29, 2022

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to

  1. read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
  2. provide summary statistics for different interesting groups within the data, and interpret those statistics

Read in the Data

Read in one (or more) of the following data sets, available in the posts/_data folder, using the correct R package and command.

  • railroad*.csv or StateCounty2012.xls ⭐
  • FAOstat*.csv or birds.csv ⭐⭐⭐
  • hotel_bookings.csv ⭐⭐⭐⭐
Code
birds <- read_csv("_data/birds.csv")%>%
  select(-c(contains("Code"), Element, Domain, Unit))
birds
# A tibble: 30,977 × 6
   Area        Item      Year Value Flag  `Flag Description`
   <chr>       <chr>    <dbl> <dbl> <chr> <chr>             
 1 Afghanistan Chickens  1961  4700 F     FAO estimate      
 2 Afghanistan Chickens  1962  4900 F     FAO estimate      
 3 Afghanistan Chickens  1963  5000 F     FAO estimate      
 4 Afghanistan Chickens  1964  5300 F     FAO estimate      
 5 Afghanistan Chickens  1965  5500 F     FAO estimate      
 6 Afghanistan Chickens  1966  5800 F     FAO estimate      
 7 Afghanistan Chickens  1967  6600 F     FAO estimate      
 8 Afghanistan Chickens  1968  6290 <NA>  Official data     
 9 Afghanistan Chickens  1969  6300 F     FAO estimate      
10 Afghanistan Chickens  1970  6000 F     FAO estimate      
# … with 30,967 more rows
Code
summary(birds)
     Area               Item                Year          Value         
 Length:30977       Length:30977       Min.   :1961   Min.   :       0  
 Class :character   Class :character   1st Qu.:1976   1st Qu.:     171  
 Mode  :character   Mode  :character   Median :1992   Median :    1800  
                                       Mean   :1991   Mean   :   99411  
                                       3rd Qu.:2005   3rd Qu.:   15404  
                                       Max.   :2018   Max.   :23707134  
                                                      NA's   :1036      
     Flag           Flag Description  
 Length:30977       Length:30977      
 Class :character   Class :character  
 Mode  :character   Mode  :character  
                                      
                                      
                                      
                                      

Describe the data

The birds dataset contained 14 variables, 8 of which are character variables and 6 are numeric variables. It was collected by the Food and Agriculture Association of the United Nations. This dataset features estimates of five types of bird(Chickens, Ducks, Geese and fowls, Turkeys, and Pigeons/Other birds) in 248 regions. The data was collected from 1961-2018.

Reading in the data, I chose to omit Element, Domain, and Unit as they are the same for every data point. I also eliminated all of the “Code” variables, as they are either redundant, or not useful to work with.

Code
Area <- select(birds,"Area")
num_areas <- unique(Area)
num_areas
# A tibble: 248 × 1
   Area               
   <chr>              
 1 Afghanistan        
 2 Albania            
 3 Algeria            
 4 American Samoa     
 5 Angola             
 6 Antigua and Barbuda
 7 Argentina          
 8 Armenia            
 9 Aruba              
10 Australia          
# … with 238 more rows
Code
Item <- select(birds,"Item")
num_items <- unique(Item)
num_items
# A tibble: 5 × 1
  Item                  
  <chr>                 
1 Chickens              
2 Ducks                 
3 Geese and guinea fowls
4 Turkeys               
5 Pigeons, other birds  

Provide Grouped Summary Statistics

I started my analysis of the birds dataset by taking a look at the average and median stock values by year.

Code
birds%>%
    group_by(Year)%>%
     summarise(avg_stocks = mean(Value, na.rm=TRUE),
               med_stocks = median(Value, na.rm=TRUE))
# A tibble: 58 × 3
    Year avg_stocks med_stocks
   <dbl>      <dbl>      <dbl>
 1  1961     36752.      1033 
 2  1962     37787.      1014 
 3  1963     38736.      1106 
 4  1964     39325.      1103 
 5  1965     40334.      1104 
 6  1966     41229.      1088.
 7  1967     43240.      1193 
 8  1968     44420.      1252.
 9  1969     45607.      1267 
10  1970     47706.      1259 
# … with 48 more rows

While this was helpful in showing a general trend for the data over the 58 years, it was very basic. The next step I took was to show the average of each Item(type of bird) across each year. I dropped the median because I felt focusing on average would provide more information.

Code
t1<-birds%>%
     group_by(Item,Year)%>%
     summarise(avg_stocks = mean(Value, na.rm=TRUE))%>%
     pivot_wider(names_from = Year, values_from = (avg_stocks))
t1
# A tibble: 5 × 59
# Groups:   Item [5]
  Item     `1961` `1962` `1963` `1964` `1965` `1966` `1967` `1968` `1969` `1970`
  <chr>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 Chickens 74060. 76753. 78922. 80213. 82458. 83880. 88047. 91003. 94121. 98297.
2 Ducks     7232.  7520.  7861.  8082.  8329.  8592.  8861.  9166.  9257.  9493.
3 Geese a…  2364.  2435.  2483.  2641.  2808.  2870.  3099.  3245.  3331.  3465.
4 Pigeons…  3307.  3771.  4004.  4227.  4440.  4630.  4673.  2840.  2978.  3110.
5 Turkeys  10610.  9043   8377.  7987.  7938.  8546.  8931.  7959.  7998.  9062.
# … with 48 more variables: `1971` <dbl>, `1972` <dbl>, `1973` <dbl>,
#   `1974` <dbl>, `1975` <dbl>, `1976` <dbl>, `1977` <dbl>, `1978` <dbl>,
#   `1979` <dbl>, `1980` <dbl>, `1981` <dbl>, `1982` <dbl>, `1983` <dbl>,
#   `1984` <dbl>, `1985` <dbl>, `1986` <dbl>, `1987` <dbl>, `1988` <dbl>,
#   `1989` <dbl>, `1990` <dbl>, `1991` <dbl>, `1992` <dbl>, `1993` <dbl>,
#   `1994` <dbl>, `1995` <dbl>, `1996` <dbl>, `1997` <dbl>, `1998` <dbl>,
#   `1999` <dbl>, `2000` <dbl>, `2001` <dbl>, `2002` <dbl>, `2003` <dbl>, …

Finally, I wanted to try to focus on a singular Area for the table, so naturally I chose to filter the Area by ‘Americas’ which had some of the largest numbers and is ugly to look at in the rendering. However, it was a very complete data point to focus on so it works out.

Code
t2<-birds%>%
     filter(Area == "Americas")%>%
     group_by(Item,Year)%>%
     summarise(avg_stocks = mean(Value, na.rm=TRUE))%>%
     pivot_wider(names_from = Year, values_from = (avg_stocks))
t2
# A tibble: 4 × 59
# Groups:   Item [4]
  Item     `1961` `1962` `1963` `1964` `1965` `1966` `1967` `1968` `1969` `1970`
  <chr>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 Chickens 1.19e6 1.22e6 1.24e6 1.29e6 1.33e6 1.37e6 1.43e6 1.47e6 1.53e6 1.56e6
2 Ducks    9.64e3 9.99e3 1.07e4 1.10e4 1.13e4 1.19e4 1.18e4 1.20e4 1.20e4 1.21e4
3 Geese a… 5.53e2 5.61e2 5.95e2 6.07e2 6.18e2 6.43e2 5.95e2 6.23e2 6.59e2 6.65e2
4 Turkeys  1.19e5 1.03e5 1.05e5 1.13e5 1.18e5 1.30e5 1.39e5 1.20e5 1.20e5 1.31e5
# … with 48 more variables: `1971` <dbl>, `1972` <dbl>, `1973` <dbl>,
#   `1974` <dbl>, `1975` <dbl>, `1976` <dbl>, `1977` <dbl>, `1978` <dbl>,
#   `1979` <dbl>, `1980` <dbl>, `1981` <dbl>, `1982` <dbl>, `1983` <dbl>,
#   `1984` <dbl>, `1985` <dbl>, `1986` <dbl>, `1987` <dbl>, `1988` <dbl>,
#   `1989` <dbl>, `1990` <dbl>, `1991` <dbl>, `1992` <dbl>, `1993` <dbl>,
#   `1994` <dbl>, `1995` <dbl>, `1996` <dbl>, `1997` <dbl>, `1998` <dbl>,
#   `1999` <dbl>, `2000` <dbl>, `2001` <dbl>, `2002` <dbl>, `2003` <dbl>, …

Explain and Interpret

Taking a look at my initial analysis of the average stock values by year, I can see that the stock values increase over time. When I divided the stock values by bird type, I could see that Chickens, Ducks, and Geese have increased steadily almost every year until plateauing in the 2010s. Pigeons peaked in the 1990s and then have leveled out ever since. Turkeys have been hovering around the same since 1980. When I further narrowed down to just the Americas, I noticed that there are no pigeons. Chickens grew steadily each year. Ducks and Turkeys plateaued around 1990. Geese experienced a peak in 1988-1989, and then dropped significantly, and then leveled off.

Source Code
---
title: "Challenge 2"
author: "Jack Sniezek"
desription: "Data wrangling: using group() and summarise()"
date: "11/29/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_2
  - railroads
  - faostat
  - hotel_bookings
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

## Challenge Overview

Today's challenge is to

1)  read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
2)  provide summary statistics for different interesting groups within the data, and interpret those statistics

## Read in the Data

Read in one (or more) of the following data sets, available in the `posts/_data` folder, using the correct R package and command.

-   railroad\*.csv or StateCounty2012.xls ⭐
-   FAOstat\*.csv or birds.csv ⭐⭐⭐
-   hotel_bookings.csv ⭐⭐⭐⭐

```{r}
birds <- read_csv("_data/birds.csv")%>%
  select(-c(contains("Code"), Element, Domain, Unit))
birds
summary(birds)
```


## Describe the data

The birds dataset contained 14 variables, 8 of which are character variables and 6 are numeric variables. It was collected by the Food and Agriculture Association of the United Nations. This dataset features estimates of five types of bird(Chickens, Ducks, Geese and fowls, Turkeys, and Pigeons/Other birds) in 248 regions. The data was collected from 1961-2018.

Reading in the data, I chose to omit Element, Domain, and Unit as they are the same for every data point. I also eliminated all of the "Code" variables, as they are either redundant, or not useful to work with.

```{r}
#| label: Showing how I found unique data

Area <- select(birds,"Area")
num_areas <- unique(Area)
num_areas

Item <- select(birds,"Item")
num_items <- unique(Item)
num_items

```

## Provide Grouped Summary Statistics

I started my analysis of the birds dataset by taking a look at the average and median stock values by year.

```{r}
birds%>%
    group_by(Year)%>%
     summarise(avg_stocks = mean(Value, na.rm=TRUE),
               med_stocks = median(Value, na.rm=TRUE))
```

While this was helpful in showing a general trend for the data over the 58 years, it was very basic. The next step I took was to show the average of each Item(type of bird) across each year. I dropped the median because I felt focusing on average would provide more information.

```{r}
t1<-birds%>%
     group_by(Item,Year)%>%
     summarise(avg_stocks = mean(Value, na.rm=TRUE))%>%
     pivot_wider(names_from = Year, values_from = (avg_stocks))
t1
```

Finally, I wanted to try to focus on a singular Area for the table, so naturally I chose to filter the Area by 'Americas' which had some of the largest numbers and is ugly to look at in the rendering. However, it was a very complete data point to focus on so it works out.

```{r}
t2<-birds%>%
     filter(Area == "Americas")%>%
     group_by(Item,Year)%>%
     summarise(avg_stocks = mean(Value, na.rm=TRUE))%>%
     pivot_wider(names_from = Year, values_from = (avg_stocks))
t2
```

### Explain and Interpret

Taking a look at my initial analysis of the average stock values by year, I can see that the stock values increase over time. When I divided the stock values by bird type, I could see that Chickens, Ducks, and Geese have increased steadily almost every year until plateauing in the 2010s. Pigeons peaked in the 1990s and then have leveled out ever since. Turkeys have been hovering around the same since 1980. When I further narrowed down to just the Americas, I noticed that there are no pigeons. Chickens grew steadily each year. Ducks and Turkeys plateaued around 1990. Geese experienced a peak in 1988-1989, and then dropped significantly, and then leveled off.