DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 2

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Read in the Data
  • Describe the data
  • Provide Grouped Summary Statistics
    • Explain and Interpret

Challenge 2

  • Show All Code
  • Hide All Code

  • View Source
challenge_2
railroads
faostat
hotel_bookings
Author

Mariia Dubyk

Published

October 5, 2022

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read in the Data

Code
birds<-read_csv("_data/birds.csv")
Code
rename(birds,Domain.Code=п.їDomain.Code)
Error in `rename()`:
! Can't rename columns that don't exist.
✖ Column `п.їDomain.Code` doesn't exist.
Code
birds <- birds [,c("Domain", "Area", "Element", "Item", "Year", "Unit", "Value", "Flag", "Flag.Description")]
Error in `birds[, c("Domain", "Area", "Element", "Item", "Year", "Unit", "Value", "Flag",
    "Flag.Description")]`:
! Can't subset columns that don't exist.
✖ Column `Flag.Description` doesn't exist.
Code
summary(birds)
 Domain Code           Domain            Area Code        Area          
 Length:30977       Length:30977       Min.   :   1   Length:30977      
 Class :character   Class :character   1st Qu.:  79   Class :character  
 Mode  :character   Mode  :character   Median : 156   Mode  :character  
                                       Mean   :1202                     
                                       3rd Qu.: 231                     
                                       Max.   :5504                     
                                                                        
  Element Code    Element            Item Code        Item          
 Min.   :5112   Length:30977       Min.   :1057   Length:30977      
 1st Qu.:5112   Class :character   1st Qu.:1057   Class :character  
 Median :5112   Mode  :character   Median :1068   Mode  :character  
 Mean   :5112                      Mean   :1066                     
 3rd Qu.:5112                      3rd Qu.:1072                     
 Max.   :5112                      Max.   :1083                     
                                                                    
   Year Code         Year          Unit               Value         
 Min.   :1961   Min.   :1961   Length:30977       Min.   :       0  
 1st Qu.:1976   1st Qu.:1976   Class :character   1st Qu.:     171  
 Median :1992   Median :1992   Mode  :character   Median :    1800  
 Mean   :1991   Mean   :1991                      Mean   :   99411  
 3rd Qu.:2005   3rd Qu.:2005                      3rd Qu.:   15404  
 Max.   :2018   Max.   :2018                      Max.   :23707134  
                                                  NA's   :1036      
     Flag           Flag Description  
 Length:30977       Length:30977      
 Class :character   Class :character  
 Mode  :character   Mode  :character  
                                      
                                      
                                      
                                      
Code
library(summarytools)
print(dfSummary(birds))
Data Frame Summary  
birds  
Dimensions: 30977 x 14  
Duplicates: 0  

----------------------------------------------------------------------------------------------------------------------------
No   Variable           Stats / Values                   Freqs (% of Valid)      Graph                  Valid      Missing  
---- ------------------ -------------------------------- ----------------------- ---------------------- ---------- ---------
1    Domain Code        1. QA                            30977 (100.0%)          IIIIIIIIIIIIIIIIIIII   30977      0        
     [character]                                                                                        (100.0%)   (0.0%)   

2    Domain             1. Live Animals                  30977 (100.0%)          IIIIIIIIIIIIIIIIIIII   30977      0        
     [character]                                                                                        (100.0%)   (0.0%)   

3    Area Code          Mean (sd) : 1201.7 (2099.4)      248 distinct values     :                      30977      0        
     [numeric]          min < med < max:                                         :                      (100.0%)   (0.0%)   
                        1 < 156 < 5504                                           :                                          
                        IQR (CV) : 152 (1.7)                                     :                 .                        
                                                                                 :                 :                        

4    Area               1. Africa                          290 ( 0.9%)                                  30977      0        
     [character]        2. Asia                            290 ( 0.9%)                                  (100.0%)   (0.0%)   
                        3. Eastern Asia                    290 ( 0.9%)                                                      
                        4. Egypt                           290 ( 0.9%)                                                      
                        5. Europe                          290 ( 0.9%)                                                      
                        6. France                          290 ( 0.9%)                                                      
                        7. Greece                          290 ( 0.9%)                                                      
                        8. Myanmar                         290 ( 0.9%)                                                      
                        9. Northern Africa                 290 ( 0.9%)                                                      
                        10. South-eastern Asia             290 ( 0.9%)                                                      
                        [ 238 others ]                   28077 (90.6%)           IIIIIIIIIIIIIIIIII                         

5    Element Code       1 distinct value                 5112 : 30977 (100.0%)   IIIIIIIIIIIIIIIIIIII   30977      0        
     [numeric]                                                                                          (100.0%)   (0.0%)   

6    Element            1. Stocks                        30977 (100.0%)          IIIIIIIIIIIIIIIIIIII   30977      0        
     [character]                                                                                        (100.0%)   (0.0%)   

7    Item Code          Mean (sd) : 1066.5 (9)           1057 : 13074 (42.2%)    IIIIIIII               30977      0        
     [numeric]          min < med < max:                 1068 :  6909 (22.3%)    IIII                   (100.0%)   (0.0%)   
                        1057 < 1068 < 1083               1072 :  4136 (13.4%)    II                                         
                        IQR (CV) : 15 (0)                1079 :  5693 (18.4%)    III                                        
                                                         1083 :  1165 ( 3.8%)                                               

8    Item               1. Chickens                      13074 (42.2%)           IIIIIIII               30977      0        
     [character]        2. Ducks                          6909 (22.3%)           IIII                   (100.0%)   (0.0%)   
                        3. Geese and guinea fowls         4136 (13.4%)           II                                         
                        4. Pigeons, other birds           1165 ( 3.8%)                                                      
                        5. Turkeys                        5693 (18.4%)           III                                        

9    Year Code          Mean (sd) : 1990.6 (16.7)        58 distinct values      . . .   . :   : : :    30977      0        
     [numeric]          min < med < max:                                         : : : . : : : : : :    (100.0%)   (0.0%)   
                        1961 < 1992 < 2018                                       : : : : : : : : : :                        
                        IQR (CV) : 29 (0)                                        : : : : : : : : : :                        
                                                                                 : : : : : : : : : :                        

10   Year               Mean (sd) : 1990.6 (16.7)        58 distinct values      . . .   . :   : : :    30977      0        
     [numeric]          min < med < max:                                         : : : . : : : : : :    (100.0%)   (0.0%)   
                        1961 < 1992 < 2018                                       : : : : : : : : : :                        
                        IQR (CV) : 29 (0)                                        : : : : : : : : : :                        
                                                                                 : : : : : : : : : :                        

11   Unit               1. 1000 Head                     30977 (100.0%)          IIIIIIIIIIIIIIIIIIII   30977      0        
     [character]                                                                                        (100.0%)   (0.0%)   

12   Value              Mean (sd) : 99410.6 (720611.4)   11495 distinct values   :                      29941      1036     
     [numeric]          min < med < max:                                         :                      (96.7%)    (3.3%)   
                        0 < 1800 < 23707134                                      :                                          
                        IQR (CV) : 15233 (7.2)                                   :                                          
                                                                                 :                                          

13   Flag               1. *                              1494 ( 7.4%)           I                      20204      10773    
     [character]        2. A                              6488 (32.1%)           IIIIII                 (65.2%)    (34.8%)  
                        3. F                             10007 (49.5%)           IIIIIIIII                                  
                        4. Im                             1213 ( 6.0%)           I                                          
                        5. M                              1002 ( 5.0%)                                                      

14   Flag Description   1. Aggregate, may include of      6488 (20.9%)           IIII                   30977      0        
     [character]        2. Data not available             1002 ( 3.2%)                                  (100.0%)   (0.0%)   
                        3. FAO data based on imputat      1213 ( 3.9%)                                                      
                        4. FAO estimate                  10007 (32.3%)           IIIIII                                     
                        5. Official data                 10773 (34.8%)           IIIIII                                     
                        6. Unofficial figure              1494 ( 4.8%)                                                      
----------------------------------------------------------------------------------------------------------------------------
Code
view(dfSummary(birds))
arrange(birds, desc(Value))
# A tibble: 30,977 × 14
   Domain Cod…¹ Domain Area …² Area  Eleme…³ Element Item …⁴ Item  Year …⁵  Year
   <chr>        <chr>    <dbl> <chr>   <dbl> <chr>     <dbl> <chr>   <dbl> <dbl>
 1 QA           Live …    5000 World    5112 Stocks     1057 Chic…    2018  2018
 2 QA           Live …    5000 World    5112 Stocks     1057 Chic…    2017  2017
 3 QA           Live …    5000 World    5112 Stocks     1057 Chic…    2016  2016
 4 QA           Live …    5000 World    5112 Stocks     1057 Chic…    2015  2015
 5 QA           Live …    5000 World    5112 Stocks     1057 Chic…    2014  2014
 6 QA           Live …    5000 World    5112 Stocks     1057 Chic…    2013  2013
 7 QA           Live …    5000 World    5112 Stocks     1057 Chic…    2012  2012
 8 QA           Live …    5000 World    5112 Stocks     1057 Chic…    2010  2010
 9 QA           Live …    5000 World    5112 Stocks     1057 Chic…    2011  2011
10 QA           Live …    5000 World    5112 Stocks     1057 Chic…    2009  2009
# … with 30,967 more rows, 4 more variables: Unit <chr>, Value <dbl>,
#   Flag <chr>, `Flag Description` <chr>, and abbreviated variable names
#   ¹​`Domain Code`, ²​`Area Code`, ³​`Element Code`, ⁴​`Item Code`, ⁵​`Year Code`

Describe the data

Data set gives information about number of stocks of 5 types of birds (Chickens; Ducks; Geese and guinea fowls; Pigeons, other birds; Turkeys). We have information about their existence and the quantity in different geographic areas (countries, continents, world) from 1961 to 2018. The data set contains 30977 rows, so we have 30977 cases. Each case contains the name of an animal (‘Item’), geographic region (‘Area’) and year (‘Year’). Columns ‘Unit’ and ‘Value’ give information about the number of certain type of birds. ‘Unit’ is 1000 heads and ‘Value’ contains numbers from 0 (min) to 23707134 (max, refers to world). We also observe columns ‘Flag’ and ‘Flag. Description’ which probably refer to data source or the way data was gathered.

Code
birds<-birds%>%
  filter(Flag == c("M", "Im", "F","*", "(Empty string)"))%>%
  filter(Item == 'Chickens')

birds%>%
  group_by(Year)%>%
  select("Value")%>%
  summarise(Median = median (Value, na.rm = TRUE), Mean = mean (Value, na.rm = TRUE), SD = sd (Value, na.rm = TRUE), Min = min (Value, na.rm = TRUE), Max = max (Value, na.rm = TRUE))
# A tibble: 58 × 6
    Year Median   Mean      SD   Min    Max
   <dbl>  <dbl>  <dbl>   <dbl> <dbl>  <dbl>
 1  1961   2000 31861. 120751.     2 530000
 2  1962   1525 13065.  32937.    35 170000
 3  1963   2650 39482. 161735.    10 780000
 4  1964   2500  8315.  22758.    42 106500
 5  1965   1095  3110    4125.     2  12000
 6  1966   2600 41534. 130255.     3 587000
 7  1967    970  8777.  26745.    27 138000
 8  1968   4100  7884.  10644.    10  44500
 9  1969   4250 58425. 208698.    75 912700
10  1970    575  3367.   5163.     2  16700
# … with 48 more rows
Code
birds%>%
  group_by(Year)%>%
  select("Value")%>%
  summarise(Quantile = quantile (Value, na.rm = TRUE))
# A tibble: 290 × 2
# Groups:   Year [58]
    Year Quantile
   <dbl>    <dbl>
 1  1961       2 
 2  1961     182.
 3  1961    2000 
 4  1961    6400 
 5  1961  530000 
 6  1962      35 
 7  1962     165 
 8  1962    1525 
 9  1962    5150 
10  1962  170000 
# … with 280 more rows
Code
birds%>%
  group_by(Area)%>%
  summarise(Median = median(Value, na.rm = TRUE), Mean = mean(Value, na.rm = TRUE), SD = sd (Value, na.rm = TRUE), Min = min (Value, na.rm = TRUE), Max = max (Value, na.rm = TRUE))
# A tibble: 181 × 6
   Area                Median    Mean      SD   Min    Max
   <chr>                <dbl>   <dbl>   <dbl> <dbl>  <dbl>
 1 Afghanistan           6300  6180     719.   5000   6800
 2 Albania               2800  2800      NA    2800   2800
 3 Algeria             101500 90265.  45927.   9100 135018
 4 American Samoa          39    42.8    15.9    25     80
 5 Angola                5700  5471.   1155.   3600   6940
 6 Antigua and Barbuda     90    92.8    34.6    50    150
 7 Argentina            82500 79125   21006.  38000 105000
 8 Armenia               3835  3835      49.5  3800   3870
 9 Aruba                   NA   NaN      NA     Inf   -Inf
10 Australia            21145 21145    1209.  20290  22000
# … with 171 more rows
Code
birds%>%
  group_by(Area)%>%
  select("Value")%>%
  summarise(Quantile = quantile (Value, na.rm = TRUE))
# A tibble: 905 × 2
# Groups:   Area [181]
   Area        Quantile
   <chr>          <dbl>
 1 Afghanistan     5000
 2 Afghanistan     6100
 3 Afghanistan     6300
 4 Afghanistan     6700
 5 Afghanistan     6800
 6 Albania         2800
 7 Albania         2800
 8 Albania         2800
 9 Albania         2800
10 Albania         2800
# … with 895 more rows

Provide Grouped Summary Statistics

  • First, I filtered data to look at number of Chickens in different areas from 1961 to 2018.
  • I also removed continents to have only countries.
  • I organized data in two groups (By year and by area).

First group organized by year. In the first table we can observe mean, median, sd, max, min of number of chicken stocks in each year. Quantile presented in the second table. In the third and fourth table, there are central tendency and dispersion for data grouped by area.

Explain and Interpret

Summary statistics for the first group gives information on how the central tendency changed during observed years. We have only one type of birds, so we do not compare different birds, but look at how the number of stocks of chickens changed. Summary for the second group shows how central tendency and dispersion differs among countries.

Source Code
---
title: "Challenge 2"
author: "Mariia Dubyk"
desription: "Data wrangling: using group() and summarise()"
date: "10/05/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_2
  - railroads
  - faostat
  - hotel_bookings
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

## Read in the Data


```{r}
birds<-read_csv("_data/birds.csv")
```

```{r}
#| label: summary
rename(birds,Domain.Code=п.їDomain.Code)
birds <- birds [,c("Domain", "Area", "Element", "Item", "Year", "Unit", "Value", "Flag", "Flag.Description")]
summary(birds)
library(summarytools)
print(dfSummary(birds))
view(dfSummary(birds))
arrange(birds, desc(Value))
     
```
## Describe the data

Data set gives information about number of stocks of 5 types of birds (Chickens; Ducks; Geese and guinea fowls; Pigeons, other birds; Turkeys). We have information about their existence and the quantity in different geographic areas (countries, continents, world) from 1961 to 2018. The data set contains 30977 rows, so we have 30977 cases. Each case contains the name of an animal ('Item'), geographic region ('Area') and year ('Year'). Columns 'Unit' and 'Value' give information about the number of certain type of birds. 'Unit' is 1000 heads and 'Value' contains numbers from 0 (min) to 23707134 (max, refers to world). We also observe columns ‘Flag’ and ‘Flag. Description’ which probably refer to data source or the way data was gathered.


```{r}
birds<-birds%>%
  filter(Flag == c("M", "Im", "F","*", "(Empty string)"))%>%
  filter(Item == 'Chickens')

birds%>%
  group_by(Year)%>%
  select("Value")%>%
  summarise(Median = median (Value, na.rm = TRUE), Mean = mean (Value, na.rm = TRUE), SD = sd (Value, na.rm = TRUE), Min = min (Value, na.rm = TRUE), Max = max (Value, na.rm = TRUE))

```
```{r}
birds%>%
  group_by(Year)%>%
  select("Value")%>%
  summarise(Quantile = quantile (Value, na.rm = TRUE))
```

```{r}
birds%>%
  group_by(Area)%>%
  summarise(Median = median(Value, na.rm = TRUE), Mean = mean(Value, na.rm = TRUE), SD = sd (Value, na.rm = TRUE), Min = min (Value, na.rm = TRUE), Max = max (Value, na.rm = TRUE))

```
```{r}
birds%>%
  group_by(Area)%>%
  select("Value")%>%
  summarise(Quantile = quantile (Value, na.rm = TRUE))

```
## Provide Grouped Summary Statistics

 - First, I filtered data to look at number of Chickens in different areas from 1961 to 2018.
 - I also removed continents to have only countries.
 - I organized data in two groups (By year and by area).

First group organized by year. In the first table we can observe mean, median, sd, max, min of number of chicken stocks in each year. Quantile presented in the second table. In the third and fourth table, there are central tendency and dispersion for data grouped by area.

### Explain and Interpret

Summary statistics for the first group gives information on how the central tendency changed during observed years. We have only one type of birds, so we do not compare different birds, but look at how the number of stocks of chickens changed. Summary for the second group shows how central tendency and dispersion differs among countries.