Challenge 2

challenge_2

FNU Avinesh Krishnan

FAOSTAT_livestock

Data wrangling: using group() and summarise()

Author

FNU Avinesh Krishnan

Published

May 15, 2023

Code

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to

read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
provide summary statistics for different interesting groups within the data, and interpret those statistics

Read in the Data

Read in one (or more) of the following data sets, available in the posts/_data folder, using the correct R package and command.

railroad*.csv or StateCounty2012.xls ⭐
FAOstat*.csv or birds.csv ⭐⭐⭐
hotel_bookings.csv ⭐⭐⭐⭐

Code

livestockdata <-read_csv("~/Desktop/601_Spring_2023/posts/_data/FAOSTAT_livestock.csv")
view(livestockdata)

Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.

Describe the data

Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).

The dataset consists of information related to live animals in different countries. It includes the following cases and variables:

Each row represents a specific case, providing details about live animal stocks in a particular country for a specific year. Variables: Domain Code: Code representing the domain of the data (e.g., QA for live animals). Domain: Domain of the data (e.g., Live Animals). Area Code: Code representing the specific area or country (e.g., 2 for Afghanistan). Area: Name of the area or country (e.g., Afghanistan). Element Code: Code representing the element of the data (e.g., 5111 for stocks). Element: Description of the element (e.g., Stocks). Item Code: Code representing the specific item or category (e.g., 1107 for asses). Item: Description of the item (e.g., Asses). Year Code: Code representing the year of the data (e.g., 1961). Year: Year of the data (e.g., 1961). Unit: Unit of measurement (e.g., Head). Value: Numeric value representing the quantity of live animals. Flag: Flag indicating additional information or data status (e.g., NA for missing data). Flag Description: Description of the flag (e.g., Official data or FAO estimate). The data appears to be gathered from official sources, likely collected by the Food and Agriculture Organization (FAO) or a related organization. It provides insights into the stock levels of various live animals in different countries over multiple years.

Code

head(livestockdata)

# A tibble: 6 × 14
  Domai…¹ Domain Area …² Area  Eleme…³ Element Item …⁴ Item  Year …⁵  Year Unit 
  <chr>   <chr>    <dbl> <chr>   <dbl> <chr>     <dbl> <chr>   <dbl> <dbl> <chr>
1 QA      Live …       2 Afgh…    5111 Stocks     1107 Asses    1961  1961 Head 
2 QA      Live …       2 Afgh…    5111 Stocks     1107 Asses    1962  1962 Head 
3 QA      Live …       2 Afgh…    5111 Stocks     1107 Asses    1963  1963 Head 
4 QA      Live …       2 Afgh…    5111 Stocks     1107 Asses    1964  1964 Head 
5 QA      Live …       2 Afgh…    5111 Stocks     1107 Asses    1965  1965 Head 
6 QA      Live …       2 Afgh…    5111 Stocks     1107 Asses    1966  1966 Head 
# … with 3 more variables: Value <dbl>, Flag <chr>, `Flag Description` <chr>,
#   and abbreviated variable names ¹`Domain Code`, ²`Area Code`,
#   ³`Element Code`, ⁴`Item Code`, ⁵`Year Code`

Code

str(livestockdata)

spc_tbl_ [82,116 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Domain Code     : chr [1:82116] "QA" "QA" "QA" "QA" ...
 $ Domain          : chr [1:82116] "Live Animals" "Live Animals" "Live Animals" "Live Animals" ...
 $ Area Code       : num [1:82116] 2 2 2 2 2 2 2 2 2 2 ...
 $ Area            : chr [1:82116] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
 $ Element Code    : num [1:82116] 5111 5111 5111 5111 5111 ...
 $ Element         : chr [1:82116] "Stocks" "Stocks" "Stocks" "Stocks" ...
 $ Item Code       : num [1:82116] 1107 1107 1107 1107 1107 ...
 $ Item            : chr [1:82116] "Asses" "Asses" "Asses" "Asses" ...
 $ Year Code       : num [1:82116] 1961 1962 1963 1964 1965 ...
 $ Year            : num [1:82116] 1961 1962 1963 1964 1965 ...
 $ Unit            : chr [1:82116] "Head" "Head" "Head" "Head" ...
 $ Value           : num [1:82116] 1300000 851850 1001112 1150000 1300000 ...
 $ Flag            : chr [1:82116] NA NA NA "F" ...
 $ Flag Description: chr [1:82116] "Official data" "Official data" "Official data" "FAO estimate" ...
 - attr(*, "spec")=
  .. cols(
  ..   `Domain Code` = col_character(),
  ..   Domain = col_character(),
  ..   `Area Code` = col_double(),
  ..   Area = col_character(),
  ..   `Element Code` = col_double(),
  ..   Element = col_character(),
  ..   `Item Code` = col_double(),
  ..   Item = col_character(),
  ..   `Year Code` = col_double(),
  ..   Year = col_double(),
  ..   Unit = col_character(),
  ..   Value = col_double(),
  ..   Flag = col_character(),
  ..   `Flag Description` = col_character()
  .. )
 - attr(*, "problems")=<externalptr>

Code

colnames(livestockdata)

 [1] "Domain Code"      "Domain"           "Area Code"        "Area"            
 [5] "Element Code"     "Element"          "Item Code"        "Item"            
 [9] "Year Code"        "Year"             "Unit"             "Value"           
[13] "Flag"             "Flag Description"

Code

 livestockdata%>%
  select("Domain") %>%
  distinct(.)

# A tibble: 1 × 1
  Domain      
  <chr>       
1 Live Animals

Code

 livestockdata%>%
  select("Element Code") %>%
  distinct(.)

# A tibble: 1 × 1
  `Element Code`
           <dbl>
1           5111

Code

dim(livestockdata)

[1] 82116    14

Code

min(livestockdata$Year)

[1] 1961

Code

max(livestockdata$Year)

[1] 2018

Code

min(livestockdata$`Area Code`)

[1] 1

Code

max(livestockdata$`Area Code`)

[1] 5504

Code

summary(livestockdata)

 Domain Code           Domain            Area Code          Area          
 Length:82116       Length:82116       Min.   :   1.0   Length:82116      
 Class :character   Class :character   1st Qu.:  73.0   Class :character  
 Mode  :character   Mode  :character   Median : 146.0   Mode  :character  
                                       Mean   : 912.7                     
                                       3rd Qu.: 221.0                     
                                       Max.   :5504.0                     
                                                                          
  Element Code    Element            Item Code        Item          
 Min.   :5111   Length:82116       Min.   : 866   Length:82116      
 1st Qu.:5111   Class :character   1st Qu.: 976   Class :character  
 Median :5111   Mode  :character   Median :1034   Mode  :character  
 Mean   :5111                      Mean   :1018                     
 3rd Qu.:5111                      3rd Qu.:1096                     
 Max.   :5111                      Max.   :1126                     
                                                                    
   Year Code         Year          Unit               Value          
 Min.   :1961   Min.   :1961   Length:82116       Min.   :0.000e+00  
 1st Qu.:1976   1st Qu.:1976   Class :character   1st Qu.:1.250e+04  
 Median :1991   Median :1991   Mode  :character   Median :2.247e+05  
 Mean   :1990   Mean   :1990                      Mean   :1.163e+07  
 3rd Qu.:2005   3rd Qu.:2005                      3rd Qu.:2.377e+06  
 Max.   :2018   Max.   :2018                      Max.   :1.490e+09  
                                                  NA's   :1301       
     Flag           Flag Description  
 Length:82116       Length:82116      
 Class :character   Class :character  
 Mode  :character   Mode  :character

Provide Grouped Summary Statistics

Conduct some exploratory data analysis, using dplyr commands such as group_by(), select(), filter(), and summarise(). Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.

Code

livestockdata %>% 
  group_by(Area) %>% 
  summarise(averge_value = mean(Value), Max_value=max(Value), Min_value=min(Value))

# A tibble: 253 × 4
   Area                averge_value Max_value Min_value
   <chr>                      <dbl>     <dbl>     <dbl>
 1 Afghanistan             3597216.  21500000     20000
 2 Africa                 78159910. 438110974    706338
 3 Albania                      NA         NA        NA
 4 Algeria                 2575444.  28693330      2000
 5 American Samoa             5399.     14600        88
 6 Americas               95795716. 522867113     66000
 7 Angola                  1123117.   5044019       800
 8 Antigua and Barbuda        7877.     36000       400
 9 Argentina              12879607.  61053808     85198
10 Armenia                  167587.   1000000        25
# … with 243 more rows

Code

livestockdata %>% 
  filter(Area == "Africa" | Area == "Afghanistan",Item == "Camels", Year == 1961)

# A tibble: 2 × 14
  Domai…¹ Domain Area …² Area  Eleme…³ Element Item …⁴ Item  Year …⁵  Year Unit 
  <chr>   <chr>    <dbl> <chr>   <dbl> <chr>     <dbl> <chr>   <dbl> <dbl> <chr>
1 QA      Live …       2 Afgh…    5111 Stocks     1126 Came…    1961  1961 Head 
2 QA      Live …    5100 Afri…    5111 Stocks     1126 Came…    1961  1961 Head 
# … with 3 more variables: Value <dbl>, Flag <chr>, `Flag Description` <chr>,
#   and abbreviated variable names ¹`Domain Code`, ²`Area Code`,
#   ³`Element Code`, ⁴`Item Code`, ⁵`Year Code`

There are two data points available for the above filter i.e for two areas.

Code

Summary_Year <- na.omit(livestockdata) %>%
  group_by(Year) %>%
  summarize(
    mean_Value = mean(Value, na.rm = TRUE),
    median_Value = median(Value, na.rm = TRUE),
    min_Value = min(Value, na.rm = TRUE),
    max_Value = max(Value, na.rm = TRUE),
    sd_Value = sd(Value, na.rm = TRUE),
    var_Value = var(Value, na.rm = TRUE),
    IQR_Value = IQR(Value, na.rm = TRUE)
  )
Summary_Year

# A tibble: 58 × 8
    Year mean_Value median_Value min_Value  max_Value  sd_Value var_Va…¹ IQR_V…²
   <dbl>      <dbl>        <dbl>     <dbl>      <dbl>     <dbl>    <dbl>   <dbl>
 1  1961  14080955.      129000         10  994268736 66803024.  4.46e15  2.14e6
 2  1962  14085663.      130000         10  997193122 66008462.  4.36e15  2.01e6
 3  1963  14568714.      154000         10  999696719 67268611.  4.53e15  2.49e6
 4  1964  15101322.      160000         10 1013486149 68974136.  4.76e15  2.61e6
 5  1965  15558689.      140000         10 1030878735 70820667.  5.02e15  2.75e6
 6  1966  15481029.      155000         10 1040688491 72204592.  5.21e15  2.42e6
 7  1967  15804831.      148000         10 1059154631 72456707.  5.25e15  2.54e6
 8  1968  15526705.      105850         10 1074022618 72343610.  5.23e15  2.25e6
 9  1969  15203756.      116412.        12 1080720926 71629580.  5.13e15  2.15e6
10  1970  15475993.      103000         12 1081641464 72217288.  5.22e15  2.24e6
# … with 48 more rows, and abbreviated variable names ¹var_Value, ²IQR_Value

Code

Summary_Area <- na.omit(livestockdata) %>%
  group_by(Area) %>%
  summarize(
    mean_Value = mean(Value, na.rm = TRUE),
    median_Value = median(Value, na.rm = TRUE),
    min_Value = min(Value, na.rm = TRUE),
    max_Value = max(Value, na.rm = TRUE),
    sd_Value = sd(Value, na.rm = TRUE),
    var_Value = var(Value, na.rm = TRUE),
    IQR_Value = IQR(Value, na.rm = TRUE)
  )
Summary_Area

# A tibble: 240 × 8
   Area                mean_Va…¹ media…² min_V…³ max_V…⁴ sd_Va…⁵ var_V…⁶ IQR_V…⁷
   <chr>                   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 Afghanistan          2571604.  4.1 e5   20000  1.96e7  4.69e6 2.20e13  2.46e6
 2 Africa              78159910.  1.45e7  706338  4.38e8  1.08e8 1.17e16  1.49e8
 3 Albania               181722.  5.5 e4      93  1.35e6  3.24e5 1.05e11  9.35e4
 4 Algeria                22361.  5   e3    2000  7   e5  9.26e4 8.58e 9  1.5 e3
 5 American Samoa          5858.  9.6 e3      95  1.46e4  5.35e3 2.86e 7  1.04e4
 6 Americas            95795716.  3.22e7   66000  5.23e8  1.39e8 1.93e16  1.26e8
 7 Angola                867412.  2.4 e5     800  5.04e6  1.28e6 1.63e12  1.17e6
 8 Antigua and Barbuda     7426.  3   e3     400  3.6 e4  8.42e3 7.09e 7  1.1 e4
 9 Argentina            3685037.  2.4 e6   88000  4.9 e7  8.14e6 6.63e13  3.52e6
10 Armenia               330230.  3.83e4    6415  1   e6  4.11e5 1.69e11  5.95e5
# … with 230 more rows, and abbreviated variable names ¹mean_Value,
#   ²median_Value, ³min_Value, ⁴max_Value, ⁵sd_Value, ⁶var_Value, ⁷IQR_Value

The data spans a period from 1961 to 2018, indicating a long-term collection of data. The Element column contains only one distinct value, which is “Stocks”. Similarly, the Element Code also has a single value, which is 5111. This suggests that the dataset focuses specifically on the stocks element. The Area Code ranges from 1 to 5504, indicating a wide range of areas or regions included in the dataset. The Value column exhibits a significant amount of variance, suggesting diverse numerical values for the specific element being measured.

Explain and Interpret

Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.