DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 2 Instructions

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Challenge Overview
  • Read in the Data
  • Describe the data
    • Explain and Interpret
  • Railroad dataset(railroad*.csv)-
  • Bird dataset(birds.csv)-
  • Provide Grouped Summary Statistics

Challenge 2 Instructions

  • Show All Code
  • Hide All Code

  • View Source
challenge_2
railroads
faostat
hotel_bookings
Author

Meredith Rolfe

Published

August 16, 2022

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to

  1. read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
  2. provide summary statistics for different interesting groups within the data, and interpret those statistics

Read in the Data

Read in one (or more) of the following data sets, available in the posts/_data folder, using the correct R package and command.

  • railroad*.csv or StateCounty2012.xls ⭐
  • FAOstat*.csv or birds.csv ⭐⭐⭐
  • hotel_bookings.csv ⭐⭐⭐⭐

Describe the data

Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).

Explain and Interpret

Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.

Railroad dataset(railroad*.csv)-

Code
data <- read_csv('_data/railroad_2012_clean_county.csv')
data
# A tibble: 2,930 × 3
   state county               total_employees
   <chr> <chr>                          <dbl>
 1 AE    APO                                2
 2 AK    ANCHORAGE                          7
 3 AK    FAIRBANKS NORTH STAR               2
 4 AK    JUNEAU                             3
 5 AK    MATANUSKA-SUSITNA                  2
 6 AK    SITKA                              1
 7 AK    SKAGWAY MUNICIPALITY              88
 8 AL    AUTAUGA                          102
 9 AL    BALDWIN                          143
10 AL    BARBOUR                            1
# … with 2,920 more rows

it shows state, county and total number of employees for a given dataset.

It’s group by state and county.

Code
data %>% group_by(county,state) %>%
  summarize_at(c('total_employees'),mean)
# A tibble: 2,930 × 3
# Groups:   county [1,709]
   county    state total_employees
   <chr>     <chr>           <dbl>
 1 ABBEVILLE SC                124
 2 ACADIA    LA                 13
 3 ACCOMACK  VA                  4
 4 ADA       ID                 81
 5 ADAIR     IA                  5
 6 ADAIR     KY                  1
 7 ADAIR     MO                 21
 8 ADAIR     OK                  2
 9 ADAMS     CO                553
10 ADAMS     IA                  7
# … with 2,920 more rows

Now let’s see the mean of the data set group by county.

Code
data %>% group_by(county) %>%
  summarize_at(c('total_employees'),mean)
# A tibble: 1,709 × 2
   county    total_employees
   <chr>               <dbl>
 1 ABBEVILLE          124   
 2 ACADIA              13   
 3 ACCOMACK             4   
 4 ADA                 81   
 5 ADAIR                7.25
 6 ADAMS               73.2 
 7 ADDISON              8   
 8 AIKEN              193   
 9 AITKIN              19   
10 ALACHUA             22   
# … with 1,699 more rows

let’s see the mean of the data set group by state.

Code
data %>% group_by(state) %>%
  summarize_at(c('total_employees'),mean)
# A tibble: 53 × 2
   state total_employees
   <chr>           <dbl>
 1 AE                2  
 2 AK               17.2
 3 AL               63.5
 4 AP                1  
 5 AR               53.8
 6 AZ              210. 
 7 CA              239. 
 8 CO               64.0
 9 CT              324  
10 DC              279  
# … with 43 more rows

The median of the data set group by state.

Code
data %>% group_by(state) %>%
  summarize_at(c('total_employees'),median)
# A tibble: 53 × 2
   state total_employees
   <chr>           <dbl>
 1 AE                2  
 2 AK                2.5
 3 AL               26  
 4 AP                1  
 5 AR               16.5
 6 AZ               94  
 7 CA               61  
 8 CO               10  
 9 CT              125  
10 DC              279  
# … with 43 more rows

The median of the data set group by county.

Code
data %>% group_by(county) %>%
  summarize_at(c('total_employees'),median)
# A tibble: 1,709 × 2
   county    total_employees
   <chr>               <dbl>
 1 ABBEVILLE           124  
 2 ACADIA               13  
 3 ACCOMACK              4  
 4 ADA                  81  
 5 ADAIR                 3.5
 6 ADAMS                19.5
 7 ADDISON               8  
 8 AIKEN               193  
 9 AITKIN               19  
10 ALACHUA              22  
# … with 1,699 more rows

CT, AZ, and DC all three are compared to check the difference in mean of the total_employees present in the data set.

Code
result <- c("CT")
filter(data, state %in% result) %>% 
  summarize_at(c('total_employees'),mean)
# A tibble: 1 × 1
  total_employees
            <dbl>
1             324
Code
result <- c("AZ")
filter(data, state %in% result) %>% 
  summarize_at(c('total_employees'),mean)
# A tibble: 1 × 1
  total_employees
            <dbl>
1            210.
Code
result <- c("DC")
filter(data, state %in% result) %>% 
  summarize_at(c('total_employees'),mean)
# A tibble: 1 × 1
  total_employees
            <dbl>
1             279

The mean value vary a lot i.e. for CT it’s 324, AZ is 210.2 and DC is 279.

Now let’s see the Standard Deviation of the data set group by state and also by county.

Code
data %>% group_by(county) %>%
  summarize_at(c('total_employees'),sd)
# A tibble: 1,709 × 2
   county    total_employees
   <chr>               <dbl>
 1 ABBEVILLE           NA   
 2 ACADIA              NA   
 3 ACCOMACK            NA   
 4 ADA                 NA   
 5 ADAIR                9.32
 6 ADAMS              155.  
 7 ADDISON             NA   
 8 AIKEN               NA   
 9 AITKIN              NA   
10 ALACHUA             NA   
# … with 1,699 more rows
Code
data %>% group_by(state) %>%
  summarize_at(c('total_employees'),sd)
# A tibble: 53 × 2
   state total_employees
   <chr>           <dbl>
 1 AE               NA  
 2 AK               34.8
 3 AL              130. 
 4 AP               NA  
 5 AR              131. 
 6 AZ              228. 
 7 CA              549. 
 8 CO              128. 
 9 CT              520. 
10 DC               NA  
# … with 43 more rows

If we see CA then CT had the highest SD but if we look at the mean then it’s 324 and 238 which is a huge difference.

Bird dataset(birds.csv)-

Provide Grouped Summary Statistics

Conduct some exploratory data analysis, using dplyr commands such as group_by(), select(), filter(), and summarise(). Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.

Code
birds <- read_csv("_data/birds.csv")%>%
  select(-c(contains("Code"), Element, Domain, Unit))
birds
# A tibble: 30,977 × 6
   Area        Item      Year Value Flag  `Flag Description`
   <chr>       <chr>    <dbl> <dbl> <chr> <chr>             
 1 Afghanistan Chickens  1961  4700 F     FAO estimate      
 2 Afghanistan Chickens  1962  4900 F     FAO estimate      
 3 Afghanistan Chickens  1963  5000 F     FAO estimate      
 4 Afghanistan Chickens  1964  5300 F     FAO estimate      
 5 Afghanistan Chickens  1965  5500 F     FAO estimate      
 6 Afghanistan Chickens  1966  5800 F     FAO estimate      
 7 Afghanistan Chickens  1967  6600 F     FAO estimate      
 8 Afghanistan Chickens  1968  6290 <NA>  Official data     
 9 Afghanistan Chickens  1969  6300 F     FAO estimate      
10 Afghanistan Chickens  1970  6000 F     FAO estimate      
# … with 30,967 more rows

This data set has many variables like Area, Item, year, value, flag, and flag description.

Code
summary(birds)
     Area               Item                Year          Value         
 Length:30977       Length:30977       Min.   :1961   Min.   :       0  
 Class :character   Class :character   1st Qu.:1976   1st Qu.:     171  
 Mode  :character   Mode  :character   Median :1992   Median :    1800  
                                       Mean   :1991   Mean   :   99411  
                                       3rd Qu.:2005   3rd Qu.:   15404  
                                       Max.   :2018   Max.   :23707134  
                                                      NA's   :1036      
     Flag           Flag Description  
 Length:30977       Length:30977      
 Class :character   Class :character  
 Mode  :character   Mode  :character  
                                      
                                      
                                      
                                      

Let’s take a look at average and median stock values by year.

Code
birds%>%
    group_by(Year)%>%
     summarise(mean(Value, na.rm=TRUE),median(Value, na.rm=TRUE))
# A tibble: 58 × 3
    Year `mean(Value, na.rm = TRUE)` `median(Value, na.rm = TRUE)`
   <dbl>                       <dbl>                         <dbl>
 1  1961                      36752.                         1033 
 2  1962                      37787.                         1014 
 3  1963                      38736.                         1106 
 4  1964                      39325.                         1103 
 5  1965                      40334.                         1104 
 6  1966                      41229.                         1088.
 7  1967                      43240.                         1193 
 8  1968                      44420.                         1252.
 9  1969                      45607.                         1267 
10  1970                      47706.                         1259 
# … with 48 more rows
Code
birds%>%
     filter(Area == "Americas")%>%
     group_by(Item,Year)%>%
     summarise(average = mean(Value, na.rm=TRUE))%>%
     pivot_wider(names_from = Year, values_from = (average))
# A tibble: 4 × 59
# Groups:   Item [4]
  Item     `1961` `1962` `1963` `1964` `1965` `1966` `1967` `1968` `1969` `1970`
  <chr>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 Chickens 1.19e6 1.22e6 1.24e6 1.29e6 1.33e6 1.37e6 1.43e6 1.47e6 1.53e6 1.56e6
2 Ducks    9.64e3 9.99e3 1.07e4 1.10e4 1.13e4 1.19e4 1.18e4 1.20e4 1.20e4 1.21e4
3 Geese a… 5.53e2 5.61e2 5.95e2 6.07e2 6.18e2 6.43e2 5.95e2 6.23e2 6.59e2 6.65e2
4 Turkeys  1.19e5 1.03e5 1.05e5 1.13e5 1.18e5 1.30e5 1.39e5 1.20e5 1.20e5 1.31e5
# … with 48 more variables: `1971` <dbl>, `1972` <dbl>, `1973` <dbl>,
#   `1974` <dbl>, `1975` <dbl>, `1976` <dbl>, `1977` <dbl>, `1978` <dbl>,
#   `1979` <dbl>, `1980` <dbl>, `1981` <dbl>, `1982` <dbl>, `1983` <dbl>,
#   `1984` <dbl>, `1985` <dbl>, `1986` <dbl>, `1987` <dbl>, `1988` <dbl>,
#   `1989` <dbl>, `1990` <dbl>, `1991` <dbl>, `1992` <dbl>, `1993` <dbl>,
#   `1994` <dbl>, `1995` <dbl>, `1996` <dbl>, `1997` <dbl>, `1998` <dbl>,
#   `1999` <dbl>, `2000` <dbl>, `2001` <dbl>, `2002` <dbl>, `2003` <dbl>, …

From the analysis I can see that the average stock value has increased over the years and it has increased almost every year for every stock.

Source Code
---
title: "Challenge 2 Instructions"
author: "Meredith Rolfe"
desription: "Data wrangling: using group() and summarise()"
date: "08/16/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_2
  - railroads
  - faostat
  - hotel_bookings
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

## Challenge Overview

Today's challenge is to

1)  read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
2)  provide summary statistics for different interesting groups within the data, and interpret those statistics

## Read in the Data

Read in one (or more) of the following data sets, available in the `posts/_data` folder, using the correct R package and command.

-   railroad\*.csv or StateCounty2012.xls ⭐
-   FAOstat\*.csv or birds.csv ⭐⭐⭐
-   hotel_bookings.csv ⭐⭐⭐⭐

## Describe the data

Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).

### Explain and Interpret

Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.

## Railroad dataset(railroad\*.csv)-

```{r}
data <- read_csv('_data/railroad_2012_clean_county.csv')
data
```

it shows state, county and total number of employees for a given dataset.

It's group by state and county.

```{r}
data %>% group_by(county,state) %>%
  summarize_at(c('total_employees'),mean)
```

Now let's see the mean of the data set group by county.

```{r}
data %>% group_by(county) %>%
  summarize_at(c('total_employees'),mean)
```

let's see the mean of the data set group by state.

```{r}
data %>% group_by(state) %>%
  summarize_at(c('total_employees'),mean)
```

The median of the data set group by state.

```{r}
data %>% group_by(state) %>%
  summarize_at(c('total_employees'),median)
```

The median of the data set group by county.

```{r}
data %>% group_by(county) %>%
  summarize_at(c('total_employees'),median)
```

CT, AZ, and DC all three are compared to check the difference in mean of the total_employees present in the data set.

```{r}
result <- c("CT")
filter(data, state %in% result) %>% 
  summarize_at(c('total_employees'),mean)
```

```{r}
result <- c("AZ")
filter(data, state %in% result) %>% 
  summarize_at(c('total_employees'),mean)
```

```{r}
result <- c("DC")
filter(data, state %in% result) %>% 
  summarize_at(c('total_employees'),mean)
```

The mean value vary a lot i.e. for CT it's 324, AZ is 210.2 and DC is 279.

Now let's see the Standard Deviation of the data set group by state and also by county.

```{r}
data %>% group_by(county) %>%
  summarize_at(c('total_employees'),sd)

```

```{r}
data %>% group_by(state) %>%
  summarize_at(c('total_employees'),sd)
```

If we see CA then CT had the highest SD but if we look at the mean then it's 324 and 238 which is a huge difference.

## Bird dataset(birds.csv)-

## Provide Grouped Summary Statistics

Conduct some exploratory data analysis, using dplyr commands such as `group_by()`, `select()`, `filter()`, and `summarise()`. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.

```{r}
birds <- read_csv("_data/birds.csv")%>%
  select(-c(contains("Code"), Element, Domain, Unit))
birds
```

This data set has many variables like Area, Item, year, value, flag, and flag description.

```{r}
summary(birds)
```

Let's take a look at average and median stock values by year.

```{r}
birds%>%
    group_by(Year)%>%
     summarise(mean(Value, na.rm=TRUE),median(Value, na.rm=TRUE))
```

```{r}
birds%>%
     filter(Area == "Americas")%>%
     group_by(Item,Year)%>%
     summarise(average = mean(Value, na.rm=TRUE))%>%
     pivot_wider(names_from = Year, values_from = (average))
```

From the analysis I can see that the average stock value has increased over the years and it has increased almost every year for every stock.