Challenge 2 Instructions

challenge_2
FAO
Author

Meredith Rolfe

Published

August 16, 2022

Code
library(tidyverse)
library(summarytools)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read in Data

Code
FAO <- read_csv("_data/FAOSTAT_livestock.csv")
FAO
# A tibble: 82,116 × 14
   Domain Cod…¹ Domain Area …² Area  Eleme…³ Element Item …⁴ Item  Year …⁵  Year
   <chr>        <chr>    <dbl> <chr>   <dbl> <chr>     <dbl> <chr>   <dbl> <dbl>
 1 QA           Live …       2 Afgh…    5111 Stocks     1107 Asses    1961  1961
 2 QA           Live …       2 Afgh…    5111 Stocks     1107 Asses    1962  1962
 3 QA           Live …       2 Afgh…    5111 Stocks     1107 Asses    1963  1963
 4 QA           Live …       2 Afgh…    5111 Stocks     1107 Asses    1964  1964
 5 QA           Live …       2 Afgh…    5111 Stocks     1107 Asses    1965  1965
 6 QA           Live …       2 Afgh…    5111 Stocks     1107 Asses    1966  1966
 7 QA           Live …       2 Afgh…    5111 Stocks     1107 Asses    1967  1967
 8 QA           Live …       2 Afgh…    5111 Stocks     1107 Asses    1968  1968
 9 QA           Live …       2 Afgh…    5111 Stocks     1107 Asses    1969  1969
10 QA           Live …       2 Afgh…    5111 Stocks     1107 Asses    1970  1970
# … with 82,106 more rows, 4 more variables: Unit <chr>, Value <dbl>,
#   Flag <chr>, `Flag Description` <chr>, and abbreviated variable names
#   ¹​`Domain Code`, ²​`Area Code`, ³​`Element Code`, ⁴​`Item Code`, ⁵​`Year Code`
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

This dataset comes from the Food and Agriculture Association of the United Nations. They publish country-level data regularly and I am going to be looking at country-level estimates of the number of animals that are raised for livestock. We can see that there are 82116 rows in the livestock data.

Describe the data

Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).

Code
spec(FAO)
cols(
  `Domain Code` = col_character(),
  Domain = col_character(),
  `Area Code` = col_double(),
  Area = col_character(),
  `Element Code` = col_double(),
  Element = col_character(),
  `Item Code` = col_double(),
  Item = col_character(),
  `Year Code` = col_double(),
  Year = col_double(),
  Unit = col_character(),
  Value = col_double(),
  Flag = col_character(),
  `Flag Description` = col_character()
)
Code
FAO.sm <- FAO %>%
  select(-contains("Code"))
FAO.sm
# A tibble: 82,116 × 9
   Domain       Area        Element Item   Year Unit    Value Flag  Flag Descr…¹
   <chr>        <chr>       <chr>   <chr> <dbl> <chr>   <dbl> <chr> <chr>       
 1 Live Animals Afghanistan Stocks  Asses  1961 Head  1300000 <NA>  Official da…
 2 Live Animals Afghanistan Stocks  Asses  1962 Head   851850 <NA>  Official da…
 3 Live Animals Afghanistan Stocks  Asses  1963 Head  1001112 <NA>  Official da…
 4 Live Animals Afghanistan Stocks  Asses  1964 Head  1150000 F     FAO estimate
 5 Live Animals Afghanistan Stocks  Asses  1965 Head  1300000 <NA>  Official da…
 6 Live Animals Afghanistan Stocks  Asses  1966 Head  1200000 <NA>  Official da…
 7 Live Animals Afghanistan Stocks  Asses  1967 Head  1200000 <NA>  Official da…
 8 Live Animals Afghanistan Stocks  Asses  1968 Head  1328000 <NA>  Official da…
 9 Live Animals Afghanistan Stocks  Asses  1969 Head  1250000 <NA>  Official da…
10 Live Animals Afghanistan Stocks  Asses  1970 Head  1300000 <NA>  Official da…
# … with 82,106 more rows, and abbreviated variable name ¹​`Flag Description`
# ℹ Use `print(n = ...)` to see more rows
Code
print(dfSummary(FAO.sm, varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

FAO.sm

Dimensions: 82116 x 9
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Missing
Domain [character] 1. Live Animals
82116(100.0%)
0 (0.0%)
Area [character]
1. Africa
2. Asia
3. China, mainland
4. Eastern Africa
5. Eastern Asia
6. Eastern Europe
7. Egypt
8. Europe
9. India
10. Northern Africa
[ 243 others ]
522(0.6%)
522(0.6%)
522(0.6%)
522(0.6%)
522(0.6%)
522(0.6%)
522(0.6%)
522(0.6%)
522(0.6%)
522(0.6%)
76896(93.6%)
0 (0.0%)
Element [character] 1. Stocks
82116(100.0%)
0 (0.0%)
Item [character]
1. Asses
2. Buffaloes
3. Camels
4. Cattle
5. Goats
6. Horses
7. Mules
8. Pigs
9. Sheep
8571(10.4%)
3505(4.3%)
3265(4.0%)
13086(15.9%)
12498(15.2%)
11104(13.5%)
6153(7.5%)
12015(14.6%)
11919(14.5%)
0 (0.0%)
Year [numeric]
Mean (sd) : 1990.4 (16.8)
min ≤ med ≤ max:
1961 ≤ 1991 ≤ 2018
IQR (CV) : 29 (0)
58 distinct values 0 (0.0%)
Unit [character] 1. Head
82116(100.0%)
0 (0.0%)
Value [numeric]
Mean (sd) : 11625569 (64779790)
min ≤ med ≤ max:
0 ≤ 224667 ≤ 1489744504
IQR (CV) : 2364200 (5.6)
43667 distinct values 1301 (1.6%)
Flag [character]
1. *
2. A
3. F
4. Im
5. M
2667(6.1%)
12567(28.7%)
24550(56.0%)
2877(6.6%)
1185(2.7%)
38270 (46.6%)
Flag Description [character]
1. Aggregate, may include of
2. Data not available
3. FAO data based on imputat
4. FAO estimate
5. Official data
6. Unofficial figure
12567(15.3%)
1185(1.4%)
2877(3.5%)
24550(29.9%)
38270(46.6%)
2667(3.2%)
0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-09-01

Based on the results of the spec() function, I can see that there are six variables that are type double and eight that are type character. Out of the six double() variables, Area Code, Year and Item Code are all good grouping variables because they do not have values that vary across rows. I dropped the double() variables that contain code because they are just numeric codes for database management purposes. Using summarytools(), I can say that the records in this dataset are the number of Live Animal Stocks and the units of the values is Head. Each case in this dataset consists of an animal record based on the country and year that tries to estimate the number of live animals which is represented by Value. In total, I have estimates of the stock of nine different types of livestock (Asses, Buffaloes, Camels, Cattle, Goats, Horses, Mules, Pigs, Sheep ) in 253 areas for 58 years. The flags correspond to what type of estimate is being used.

Provide Grouped Summary Statistics

Code
FAO.sm %>%
  filter(Flag=="A")%>%
  group_by(Area)%>%
  summarize(n=n())
# A tibble: 28 × 2
   Area                          n
   <chr>                     <int>
 1 Africa                      522
 2 Americas                    464
 3 Asia                        522
 4 Australia and New Zealand   376
 5 Caribbean                   464
 6 Central America             406
 7 Central Asia                243
 8 Eastern Africa              522
 9 Eastern Asia                522
10 Eastern Europe              522
# … with 18 more rows
# ℹ Use `print(n = ...)` to see more rows
Code
FAO_clc <- FAO.sm %>%
  filter(Flag!="A")

FAO_clc  
# A tibble: 31,279 × 9
   Domain       Area        Element Item   Year Unit    Value Flag  Flag Descr…¹
   <chr>        <chr>       <chr>   <chr> <dbl> <chr>   <dbl> <chr> <chr>       
 1 Live Animals Afghanistan Stocks  Asses  1964 Head  1150000 F     FAO estimate
 2 Live Animals Afghanistan Stocks  Asses  1973 Head  1250000 F     FAO estimate
 3 Live Animals Afghanistan Stocks  Asses  1974 Head  1250000 F     FAO estimate
 4 Live Animals Afghanistan Stocks  Asses  1975 Head  1250000 F     FAO estimate
 5 Live Animals Afghanistan Stocks  Asses  1976 Head  1250000 F     FAO estimate
 6 Live Animals Afghanistan Stocks  Asses  1978 Head  1300000 *     Unofficial …
 7 Live Animals Afghanistan Stocks  Asses  1979 Head  1300000 *     Unofficial …
 8 Live Animals Afghanistan Stocks  Asses  1980 Head  1295000 *     Unofficial …
 9 Live Animals Afghanistan Stocks  Asses  1981 Head  1315000 *     Unofficial …
10 Live Animals Afghanistan Stocks  Asses  1982 Head  1315000 *     Unofficial …
# … with 31,269 more rows, and abbreviated variable name ¹​`Flag Description`
# ℹ Use `print(n = ...)` to see more rows
Code
FAO_clc %>%
  group_by(Item) %>%
  summarize(avg=mean(Value, na.rm = TRUE),
            mode = n(),
            median = median(Value, na.rm = TRUE),
            stdev= sd(Value, na.rm = TRUE),
            min = min(Value, na.rm = TRUE),
            max = max(Value, na.rm = TRUE))
# A tibble: 9 × 7
  Item           avg  mode median     stdev   min       max
  <chr>        <dbl> <int>  <dbl>     <dbl> <dbl>     <dbl>
1 Asses      196051.  4899  14300   615866.     0   8793747
2 Buffaloes 5901247.   756   6550 19546207.    20 114151770
3 Camels     499737.  1075  85350  1311815.    45   7762545
4 Cattle    4380953.  3554  47650 21361967.    15 203634000
5 Goats     2577844.  4711  57625 11790005.     0 139467008
6 Horses     276368.  5046  11500   999297.     0  10479246
7 Mules       80414.  3357   4400   357196.     0   3287449
8 Pigs       746710.  4107  29000  9429158.     0 345754816
9 Sheep     2463044.  3774  40000  8206951.     0 111238000

Explain and Interpret

Here we can confirm that not all cases are countries. Flag Value A corresponds to Areas that are actually regional aggregations. These should be filtered out if I want to keep the same type of case as a country-level case. The second filter statement removes all cases with Flag Value A so that our dataset is at a country-level case. It seems like the distribution of cases for regional aggregations is even except for Areas Melanesia and Micronesia. FAO_clc is more specific version of the dataset that only includes the cases that are type country-level. I have conducted exploartory analysis on FAO_clc on the group Item and my first impression was how vastly different the mean and median were for each Item. This implies that our data is skewed in one direction. I also see that each Item stdev is really high which indicates that the data observed is quite spread out. The min and max values tell little about the dataset