DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 2

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Challenge Overview
  • Read in the Data
  • Describe the data
  • Provide Grouped Summary Statistics
    • Explain and Interpret

Challenge 2

  • Show All Code
  • Hide All Code

  • View Source
challenge_2
faostat
Author

Tracy Tien

Published

Invalid Date

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to

  1. read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
  2. provide summary statistics for different interesting groups within the data, and interpret those statistics

Read in the Data

  • FAOSTAT_livestock.csv

I will read in the data and do a series of “poking around” (commented out in the code below) to get a sense of the dataset, and return a more concise version of the data.

Code
FAOstat <- read_csv("_data/FAOSTAT_livestock.csv")
# I'm doing a series of "poking around" after reading in the data (commented out) to get a sense of the dataset.
# spec(FAOstat)
# dim(FAOstat) showed that FAOSTAT_livestock has 82116 records and 14 rows. 
# colnames(FAOstat)
head(FAOstat)
# A tibble: 6 × 14
  Domai…¹ Domain Area …² Area  Eleme…³ Element Item …⁴ Item  Year …⁵  Year Unit 
  <chr>   <chr>    <dbl> <chr>   <dbl> <chr>     <dbl> <chr>   <dbl> <dbl> <chr>
1 QA      Live …       2 Afgh…    5111 Stocks     1107 Asses    1961  1961 Head 
2 QA      Live …       2 Afgh…    5111 Stocks     1107 Asses    1962  1962 Head 
3 QA      Live …       2 Afgh…    5111 Stocks     1107 Asses    1963  1963 Head 
4 QA      Live …       2 Afgh…    5111 Stocks     1107 Asses    1964  1964 Head 
5 QA      Live …       2 Afgh…    5111 Stocks     1107 Asses    1965  1965 Head 
6 QA      Live …       2 Afgh…    5111 Stocks     1107 Asses    1966  1966 Head 
# … with 3 more variables: Value <dbl>, Flag <chr>, `Flag Description` <chr>,
#   and abbreviated variable names ¹​`Domain Code`, ²​`Area Code`,
#   ³​`Element Code`, ⁴​`Item Code`, ⁵​`Year Code`
Code
#install.packages("dplyr")
# I want to keep the descriptive columns instead of their corresponding codes.
codevars <- names(FAOstat) %in% c("Domain Code", "Area Code", "Element Code", "Item Code", "Year Code")
FAOStat_NoCodes <- FAOstat[!codevars]
FAOStat_NoCodes
# A tibble: 82,116 × 9
   Domain       Area        Element Item   Year Unit    Value Flag  Flag Descr…¹
   <chr>        <chr>       <chr>   <chr> <dbl> <chr>   <dbl> <chr> <chr>       
 1 Live Animals Afghanistan Stocks  Asses  1961 Head  1300000 <NA>  Official da…
 2 Live Animals Afghanistan Stocks  Asses  1962 Head   851850 <NA>  Official da…
 3 Live Animals Afghanistan Stocks  Asses  1963 Head  1001112 <NA>  Official da…
 4 Live Animals Afghanistan Stocks  Asses  1964 Head  1150000 F     FAO estimate
 5 Live Animals Afghanistan Stocks  Asses  1965 Head  1300000 <NA>  Official da…
 6 Live Animals Afghanistan Stocks  Asses  1966 Head  1200000 <NA>  Official da…
 7 Live Animals Afghanistan Stocks  Asses  1967 Head  1200000 <NA>  Official da…
 8 Live Animals Afghanistan Stocks  Asses  1968 Head  1328000 <NA>  Official da…
 9 Live Animals Afghanistan Stocks  Asses  1969 Head  1250000 <NA>  Official da…
10 Live Animals Afghanistan Stocks  Asses  1970 Head  1300000 <NA>  Official da…
# … with 82,106 more rows, and abbreviated variable name ¹​`Flag Description`

Describe the data

The FAOSTAT_livestock data contains information for 253 United Nations Food & Agriculture Organization countries, on livestock animals such as asses (donkeys?), buffaloes, camels, cattle, goats, horses, mules, pigs, and sheep (measured in heads.) The temporal coverage is from 1961-2018. The data were likely reported by the country (based on the “Flag Description” field denoting as “offical data”) or estimated by the FAO (“FAO estimate”.)

Code
n_distinct(select(FAOStat_NoCodes, "Area"))
[1] 253
Code
# table(select(FAOStat_NoCodes,"Element"))
table(select(FAOStat_NoCodes, "Item"))
Item
    Asses Buffaloes    Camels    Cattle     Goats    Horses     Mules      Pigs 
     8571      3505      3265     13086     12498     11104      6153     12015 
    Sheep 
    11919 
Code
table(select(FAOStat_NoCodes, "Year"))
Year
1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 
1327 1327 1327 1328 1329 1341 1340 1340 1340 1340 1339 1340 1340 1338 1338 1336 
1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 
1336 1337 1338 1338 1338 1337 1337 1339 1344 1344 1344 1344 1343 1360 1361 1488 
1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 
1497 1497 1497 1497 1497 1497 1497 1502 1502 1502 1502 1502 1502 1506 1506 1506 
2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 
1506 1506 1506 1513 1513 1513 1513 1513 1513 1513 

Provide Grouped Summary Statistics

Conduct some exploratory data analysis, using dplyr commands such as group_by(), select(), filter(), and summarise(). Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.

I first grouped the data by Item (which denotes what type of livestock) and found the average for each, then I sorted them in descending order (using tally and sort). I was able to show for each country and each item, then it occurred to me that it would be interesting to just see the top producing livestock for each country. This then allows us to see the difference between the overall trend for all FAO countries, and how countries differ from that (even by how much.) Future questions might be to compare by year.

Code
# make a cleaner data to make a summary table, this can also be done with select()
codevars1 <- names(FAOstat) %in% c("Domain", "Domain Code","Area Code", "Element", "Element Code", "Item Code", "Unit", "Year Code", "Flag", "Flag Description")
FAOStat_cleaned <- FAOstat[!codevars1]
FAOStat_cleaned
# A tibble: 82,116 × 4
   Area        Item   Year   Value
   <chr>       <chr> <dbl>   <dbl>
 1 Afghanistan Asses  1961 1300000
 2 Afghanistan Asses  1962  851850
 3 Afghanistan Asses  1963 1001112
 4 Afghanistan Asses  1964 1150000
 5 Afghanistan Asses  1965 1300000
 6 Afghanistan Asses  1966 1200000
 7 Afghanistan Asses  1967 1200000
 8 Afghanistan Asses  1968 1328000
 9 Afghanistan Asses  1969 1250000
10 Afghanistan Asses  1970 1300000
# … with 82,106 more rows
Code
# summary of cleaner data
library(summarytools)
print(dfSummary(FAOStat_cleaned))
Data Frame Summary  
FAOStat_cleaned  
Dimensions: 82116 x 4  
Duplicates: 0  

-----------------------------------------------------------------------------------------------------------------------
No   Variable      Stats / Values                    Freqs (% of Valid)      Graph                 Valid      Missing  
---- ------------- --------------------------------- ----------------------- --------------------- ---------- ---------
1    Area          1. Africa                           522 ( 0.6%)                                 82116      0        
     [character]   2. Asia                             522 ( 0.6%)                                 (100.0%)   (0.0%)   
                   3. China, mainland                  522 ( 0.6%)                                                     
                   4. Eastern Africa                   522 ( 0.6%)                                                     
                   5. Eastern Asia                     522 ( 0.6%)                                                     
                   6. Eastern Europe                   522 ( 0.6%)                                                     
                   7. Egypt                            522 ( 0.6%)                                                     
                   8. Europe                           522 ( 0.6%)                                                     
                   9. India                            522 ( 0.6%)                                                     
                   10. Northern Africa                 522 ( 0.6%)                                                     
                   [ 243 others ]                    76896 (93.6%)           IIIIIIIIIIIIIIIIII                        

2    Item          1. Asses                           8571 (10.4%)           II                    82116      0        
     [character]   2. Buffaloes                       3505 ( 4.3%)                                 (100.0%)   (0.0%)   
                   3. Camels                          3265 ( 4.0%)                                                     
                   4. Cattle                         13086 (15.9%)           III                                       
                   5. Goats                          12498 (15.2%)           III                                       
                   6. Horses                         11104 (13.5%)           II                                        
                   7. Mules                           6153 ( 7.5%)           I                                         
                   8. Pigs                           12015 (14.6%)           II                                        
                   9. Sheep                          11919 (14.5%)           II                                        

3    Year          Mean (sd) : 1990.4 (16.8)         58 distinct values      . . .   . :   : : :   82116      0        
     [numeric]     min < med < max:                                          : : : . : : : : : :   (100.0%)   (0.0%)   
                   1961 < 1991 < 2018                                        : : : : : : : : : :                       
                   IQR (CV) : 29 (0)                                         : : : : : : : : : :                       
                                                                             : : : : : : : : : :                       

4    Value         Mean (sd) : 11625569 (64779790)   43667 distinct values   :                     80815      1301     
     [numeric]     min < med < max:                                          :                     (98.4%)    (1.6%)   
                   0 < 224667 < 1489744504                                   :                                         
                   IQR (CV) : 2364200 (5.6)                                  :                                         
                                                                             :                                         
-----------------------------------------------------------------------------------------------------------------------
Code
# group by the type of livestock (= Item)
by_item <- FAOStat_cleaned %>% group_by(Item)

#by_item %>% tally(sort = TRUE) 
#by_item %>%
  #summarize(avg_value = mean(Value, na.rm=TRUE)) 


#For each country, group by item, and show the top average value livestock for each country.
item_by_area <- FAOStat_cleaned %>%
  group_by(Area, Item) %>%
  summarize(mean_value = mean(Value, na.rm =TRUE),
            median_value = median(Value, na.rm = TRUE)) %>%
  top_n(1)

item_by_area
# A tibble: 256 × 4
# Groups:   Area [253]
   Area                Item   mean_value median_value
   <chr>               <chr>       <dbl>        <dbl>
 1 Afghanistan         Sheep   15397675.    14185000 
 2 Africa              Sheep  226054195.   205408754.
 3 Albania             Sheep    1623645.     1664250 
 4 Algeria             Sheep   15573059.    16974215 
 5 American Samoa      Pigs       10659.       10500 
 6 Americas            Cattle 431297713.   432220557 
 7 Angola              Cattle   3339982.     3200000 
 8 Antigua and Barbuda Cattle     11211.       13327 
 9 Argentina           Cattle  52116560.    52569504 
10 Armenia             Sheep     612495.      580178 
# … with 246 more rows

Explain and Interpret

Please see above section.

Source Code
---
title: "Challenge 2"
author: "Tracy Tien"
desription: "Data wrangling: using group() and summarise()"
date: "What is time these days?"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_2
  - faostat
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

## Challenge Overview

Today's challenge is to

1)  read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
2)  provide summary statistics for different interesting groups within the data, and interpret those statistics

## Read in the Data

-   FAOSTAT_livestock.csv

I will read in the data and do a series of "poking around" (commented out in the code below) to get a sense of the dataset, and return a more concise version of the data.

```{r}
FAOstat <- read_csv("_data/FAOSTAT_livestock.csv")
# I'm doing a series of "poking around" after reading in the data (commented out) to get a sense of the dataset.
# spec(FAOstat)
# dim(FAOstat) showed that FAOSTAT_livestock has 82116 records and 14 rows. 
# colnames(FAOstat)
head(FAOstat)
#install.packages("dplyr")
# I want to keep the descriptive columns instead of their corresponding codes.
codevars <- names(FAOstat) %in% c("Domain Code", "Area Code", "Element Code", "Item Code", "Year Code")
FAOStat_NoCodes <- FAOstat[!codevars]
FAOStat_NoCodes
```

## Describe the data

The FAOSTAT_livestock data contains information for 253 United Nations Food & Agriculture Organization countries, on livestock animals such as asses (donkeys?), buffaloes, camels, cattle, goats, horses, mules, pigs, and sheep (measured in heads.) The temporal coverage is from 1961-2018. The data were likely reported by the country (based on the "Flag Description" field denoting as "offical data") or estimated by the FAO ("FAO estimate".)


```{r}
#| label: summary
n_distinct(select(FAOStat_NoCodes, "Area"))
# table(select(FAOStat_NoCodes,"Element"))
table(select(FAOStat_NoCodes, "Item"))
table(select(FAOStat_NoCodes, "Year"))
```

## Provide Grouped Summary Statistics

Conduct some exploratory data analysis, using dplyr commands such as `group_by()`, `select()`, `filter()`, and `summarise()`. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.

I first grouped the data by Item (which denotes what type of livestock) and found the average for each, then I sorted them in descending order (using tally and sort). I was able to show for each country and each item, then it occurred to me that it would be interesting to just see the top producing livestock for each country. This then allows us to see the difference between the overall trend for all FAO countries, and how countries differ from that (even by how much.) Future questions might be to compare by year.

```{r}
# make a cleaner data to make a summary table, this can also be done with select()
codevars1 <- names(FAOstat) %in% c("Domain", "Domain Code","Area Code", "Element", "Element Code", "Item Code", "Unit", "Year Code", "Flag", "Flag Description")
FAOStat_cleaned <- FAOstat[!codevars1]
FAOStat_cleaned

# summary of cleaner data
library(summarytools)
print(dfSummary(FAOStat_cleaned))

# group by the type of livestock (= Item)
by_item <- FAOStat_cleaned %>% group_by(Item)

#by_item %>% tally(sort = TRUE) 
#by_item %>%
  #summarize(avg_value = mean(Value, na.rm=TRUE)) 


#For each country, group by item, and show the top average value livestock for each country.
item_by_area <- FAOStat_cleaned %>%
  group_by(Area, Item) %>%
  summarize(mean_value = mean(Value, na.rm =TRUE),
            median_value = median(Value, na.rm = TRUE)) %>%
  top_n(1)

item_by_area


```

### Explain and Interpret

Please see above section.