DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 2

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Challenge Overview
  • Read in the Data
  • Describe the data
  • Provide Grouped Summary Statistics
    • Explain and Interpret

Challenge 2

  • Show All Code
  • Hide All Code

  • View Source
challenge_2
railroads
faostat
hotel_bookings
Author

Megha Joseph

Published

October 19, 2022

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to

  1. read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
  2. provide summary statistics for different interesting groups within the data, and interpret those statistics

Read in the Data

Read in one (or more) of the following data sets, available in the posts/_data folder, using the correct R package and command.

  • railroad*.csv or StateCounty2012.xls ⭐
  • FAOstat*.csv or birds.csv ⭐⭐⭐
  • hotel_bookings.csv ⭐⭐⭐⭐

I have read the FAOSTAT_cattle_diary Excel File Sheet.

Code
df <- read_csv("_data/FAOstat_cattle_dairy.csv")
df
# A tibble: 36,449 × 14
   Domain Cod…¹ Domain Area …² Area  Eleme…³ Element Item …⁴ Item  Year …⁵  Year
   <chr>        <chr>    <dbl> <chr>   <dbl> <chr>     <dbl> <chr>   <dbl> <dbl>
 1 QL           Lives…       2 Afgh…    5318 Milk A…     882 Milk…    1961  1961
 2 QL           Lives…       2 Afgh…    5420 Yield       882 Milk…    1961  1961
 3 QL           Lives…       2 Afgh…    5510 Produc…     882 Milk…    1961  1961
 4 QL           Lives…       2 Afgh…    5318 Milk A…     882 Milk…    1962  1962
 5 QL           Lives…       2 Afgh…    5420 Yield       882 Milk…    1962  1962
 6 QL           Lives…       2 Afgh…    5510 Produc…     882 Milk…    1962  1962
 7 QL           Lives…       2 Afgh…    5318 Milk A…     882 Milk…    1963  1963
 8 QL           Lives…       2 Afgh…    5420 Yield       882 Milk…    1963  1963
 9 QL           Lives…       2 Afgh…    5510 Produc…     882 Milk…    1963  1963
10 QL           Lives…       2 Afgh…    5318 Milk A…     882 Milk…    1964  1964
# … with 36,439 more rows, 4 more variables: Unit <chr>, Value <dbl>,
#   Flag <chr>, `Flag Description` <chr>, and abbreviated variable names
#   ¹​`Domain Code`, ²​`Area Code`, ³​`Element Code`, ⁴​`Item Code`, ⁵​`Year Code`

Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.

Describe the data

Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).

Code
summary(df)
 Domain Code           Domain            Area Code          Area          
 Length:36449       Length:36449       Min.   :   1.0   Length:36449      
 Class :character   Class :character   1st Qu.:  69.0   Class :character  
 Mode  :character   Mode  :character   Median : 141.0   Mode  :character  
                                       Mean   : 775.2                     
                                       3rd Qu.: 215.0                     
                                       Max.   :5504.0                     
                                                                          
  Element Code    Element            Item Code       Item          
 Min.   :5318   Length:36449       Min.   :882   Length:36449      
 1st Qu.:5318   Class :character   1st Qu.:882   Class :character  
 Median :5420   Mode  :character   Median :882   Mode  :character  
 Mean   :5416                      Mean   :882                     
 3rd Qu.:5510                      3rd Qu.:882                     
 Max.   :5510                      Max.   :882                     
                                                                   
   Year Code         Year          Unit               Value          
 Min.   :1961   Min.   :1961   Length:36449       Min.   :        7  
 1st Qu.:1976   1st Qu.:1976   Class :character   1st Qu.:     7849  
 Median :1991   Median :1991   Mode  :character   Median :    43266  
 Mean   :1990   Mean   :1990                      Mean   :  4410235  
 3rd Qu.:2005   3rd Qu.:2005                      3rd Qu.:   700000  
 Max.   :2018   Max.   :2018                      Max.   :683217055  
                                                  NA's   :74         
     Flag           Flag Description  
 Length:36449       Length:36449      
 Class :character   Class :character  
 Mode  :character   Mode  :character  
                                      
                                      
                                      
                                      

Provide Grouped Summary Statistics

Conduct some exploratory data analysis, using dplyr commands such as group_by(), select(), filter(), and summarise(). Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.

Code
summarise()
Error in UseMethod("summarise"): no applicable method for 'summarise' applied to an object of class "NULL"

Explain and Interpret

Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.

Source Code
---
title: "Challenge 2"
author: "Megha Joseph"
desription: "Data wrangling: using group() and summarise()"
date: "10/19/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_2
  - railroads
  - faostat
  - hotel_bookings
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

## Challenge Overview

Today's challenge is to

1)  read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
2)  provide summary statistics for different interesting groups within the data, and interpret those statistics

## Read in the Data

Read in one (or more) of the following data sets, available in the `posts/_data` folder, using the correct R package and command.

-   railroad\*.csv or StateCounty2012.xls ⭐
-   FAOstat\*.csv or birds.csv ⭐⭐⭐
-   hotel_bookings.csv ⭐⭐⭐⭐

I have read the FAOSTAT_cattle_diary Excel File Sheet.

```{r}
df <- read_csv("_data/FAOstat_cattle_dairy.csv")
df
```

Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.

## Describe the data

Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).



```{r}
#| label: summary
summary(df)


```

## Provide Grouped Summary Statistics

Conduct some exploratory data analysis, using dplyr commands such as `group_by()`, `select()`, `filter()`, and `summarise()`. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.

```{r}
summarise()

```

### Explain and Interpret

Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.