Code
library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Tracy Tien
Invalid Date
Today’s challenge is to
I will read in the data and do a series of “poking around” (commented out in the code below) to get a sense of the dataset, and return a more concise version of the data.
# A tibble: 6 × 14
Domai…¹ Domain Area …² Area Eleme…³ Element Item …⁴ Item Year …⁵ Year Unit
<chr> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl> <chr>
1 QA Live … 2 Afgh… 5111 Stocks 1107 Asses 1961 1961 Head
2 QA Live … 2 Afgh… 5111 Stocks 1107 Asses 1962 1962 Head
3 QA Live … 2 Afgh… 5111 Stocks 1107 Asses 1963 1963 Head
4 QA Live … 2 Afgh… 5111 Stocks 1107 Asses 1964 1964 Head
5 QA Live … 2 Afgh… 5111 Stocks 1107 Asses 1965 1965 Head
6 QA Live … 2 Afgh… 5111 Stocks 1107 Asses 1966 1966 Head
# … with 3 more variables: Value <dbl>, Flag <chr>, `Flag Description` <chr>,
# and abbreviated variable names ¹`Domain Code`, ²`Area Code`,
# ³`Element Code`, ⁴`Item Code`, ⁵`Year Code`
# A tibble: 82,116 × 9
Domain Area Element Item Year Unit Value Flag Flag Descr…¹
<chr> <chr> <chr> <chr> <dbl> <chr> <dbl> <chr> <chr>
1 Live Animals Afghanistan Stocks Asses 1961 Head 1300000 <NA> Official da…
2 Live Animals Afghanistan Stocks Asses 1962 Head 851850 <NA> Official da…
3 Live Animals Afghanistan Stocks Asses 1963 Head 1001112 <NA> Official da…
4 Live Animals Afghanistan Stocks Asses 1964 Head 1150000 F FAO estimate
5 Live Animals Afghanistan Stocks Asses 1965 Head 1300000 <NA> Official da…
6 Live Animals Afghanistan Stocks Asses 1966 Head 1200000 <NA> Official da…
7 Live Animals Afghanistan Stocks Asses 1967 Head 1200000 <NA> Official da…
8 Live Animals Afghanistan Stocks Asses 1968 Head 1328000 <NA> Official da…
9 Live Animals Afghanistan Stocks Asses 1969 Head 1250000 <NA> Official da…
10 Live Animals Afghanistan Stocks Asses 1970 Head 1300000 <NA> Official da…
# … with 82,106 more rows, and abbreviated variable name ¹`Flag Description`
The FAOSTAT_livestock data contains information for 253 United Nations Food & Agriculture Organization countries, on livestock animals such as asses (donkeys?), buffaloes, camels, cattle, goats, horses, mules, pigs, and sheep (measured in heads.) The temporal coverage is from 1961-2018. The data were likely reported by the country (based on the “Flag Description” field denoting as “offical data”) or estimated by the FAO (“FAO estimate”.)
[1] 253
Item
Asses Buffaloes Camels Cattle Goats Horses Mules Pigs
8571 3505 3265 13086 12498 11104 6153 12015
Sheep
11919
Year
1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976
1327 1327 1327 1328 1329 1341 1340 1340 1340 1340 1339 1340 1340 1338 1338 1336
1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992
1336 1337 1338 1338 1338 1337 1337 1339 1344 1344 1344 1344 1343 1360 1361 1488
1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008
1497 1497 1497 1497 1497 1497 1497 1502 1502 1502 1502 1502 1502 1506 1506 1506
2009 2010 2011 2012 2013 2014 2015 2016 2017 2018
1506 1506 1506 1513 1513 1513 1513 1513 1513 1513
Conduct some exploratory data analysis, using dplyr commands such as group_by()
, select()
, filter()
, and summarise()
. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
I first grouped the data by Item (which denotes what type of livestock) and found the average for each, then I sorted them in descending order (using tally and sort). I was able to show for each country and each item, then it occurred to me that it would be interesting to just see the top producing livestock for each country. This then allows us to see the difference between the overall trend for all FAO countries, and how countries differ from that (even by how much.) Future questions might be to compare by year.
# A tibble: 82,116 × 4
Area Item Year Value
<chr> <chr> <dbl> <dbl>
1 Afghanistan Asses 1961 1300000
2 Afghanistan Asses 1962 851850
3 Afghanistan Asses 1963 1001112
4 Afghanistan Asses 1964 1150000
5 Afghanistan Asses 1965 1300000
6 Afghanistan Asses 1966 1200000
7 Afghanistan Asses 1967 1200000
8 Afghanistan Asses 1968 1328000
9 Afghanistan Asses 1969 1250000
10 Afghanistan Asses 1970 1300000
# … with 82,106 more rows
Data Frame Summary
FAOStat_cleaned
Dimensions: 82116 x 4
Duplicates: 0
-----------------------------------------------------------------------------------------------------------------------
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
---- ------------- --------------------------------- ----------------------- --------------------- ---------- ---------
1 Area 1. Africa 522 ( 0.6%) 82116 0
[character] 2. Asia 522 ( 0.6%) (100.0%) (0.0%)
3. China, mainland 522 ( 0.6%)
4. Eastern Africa 522 ( 0.6%)
5. Eastern Asia 522 ( 0.6%)
6. Eastern Europe 522 ( 0.6%)
7. Egypt 522 ( 0.6%)
8. Europe 522 ( 0.6%)
9. India 522 ( 0.6%)
10. Northern Africa 522 ( 0.6%)
[ 243 others ] 76896 (93.6%) IIIIIIIIIIIIIIIIII
2 Item 1. Asses 8571 (10.4%) II 82116 0
[character] 2. Buffaloes 3505 ( 4.3%) (100.0%) (0.0%)
3. Camels 3265 ( 4.0%)
4. Cattle 13086 (15.9%) III
5. Goats 12498 (15.2%) III
6. Horses 11104 (13.5%) II
7. Mules 6153 ( 7.5%) I
8. Pigs 12015 (14.6%) II
9. Sheep 11919 (14.5%) II
3 Year Mean (sd) : 1990.4 (16.8) 58 distinct values . . . . : : : : 82116 0
[numeric] min < med < max: : : : . : : : : : : (100.0%) (0.0%)
1961 < 1991 < 2018 : : : : : : : : : :
IQR (CV) : 29 (0) : : : : : : : : : :
: : : : : : : : : :
4 Value Mean (sd) : 11625569 (64779790) 43667 distinct values : 80815 1301
[numeric] min < med < max: : (98.4%) (1.6%)
0 < 224667 < 1489744504 :
IQR (CV) : 2364200 (5.6) :
:
-----------------------------------------------------------------------------------------------------------------------
# group by the type of livestock (= Item)
by_item <- FAOStat_cleaned %>% group_by(Item)
#by_item %>% tally(sort = TRUE)
#by_item %>%
#summarize(avg_value = mean(Value, na.rm=TRUE))
#For each country, group by item, and show the top average value livestock for each country.
item_by_area <- FAOStat_cleaned %>%
group_by(Area, Item) %>%
summarize(mean_value = mean(Value, na.rm =TRUE),
median_value = median(Value, na.rm = TRUE)) %>%
top_n(1)
item_by_area
# A tibble: 256 × 4
# Groups: Area [253]
Area Item mean_value median_value
<chr> <chr> <dbl> <dbl>
1 Afghanistan Sheep 15397675. 14185000
2 Africa Sheep 226054195. 205408754.
3 Albania Sheep 1623645. 1664250
4 Algeria Sheep 15573059. 16974215
5 American Samoa Pigs 10659. 10500
6 Americas Cattle 431297713. 432220557
7 Angola Cattle 3339982. 3200000
8 Antigua and Barbuda Cattle 11211. 13327
9 Argentina Cattle 52116560. 52569504
10 Armenia Sheep 612495. 580178
# … with 246 more rows
Please see above section.
---
title: "Challenge 2"
author: "Tracy Tien"
desription: "Data wrangling: using group() and summarise()"
date: "What is time these days?"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_2
- faostat
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to
1) read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
2) provide summary statistics for different interesting groups within the data, and interpret those statistics
## Read in the Data
- FAOSTAT_livestock.csv
I will read in the data and do a series of "poking around" (commented out in the code below) to get a sense of the dataset, and return a more concise version of the data.
```{r}
FAOstat <- read_csv("_data/FAOSTAT_livestock.csv")
# I'm doing a series of "poking around" after reading in the data (commented out) to get a sense of the dataset.
# spec(FAOstat)
# dim(FAOstat) showed that FAOSTAT_livestock has 82116 records and 14 rows.
# colnames(FAOstat)
head(FAOstat)
#install.packages("dplyr")
# I want to keep the descriptive columns instead of their corresponding codes.
codevars <- names(FAOstat) %in% c("Domain Code", "Area Code", "Element Code", "Item Code", "Year Code")
FAOStat_NoCodes <- FAOstat[!codevars]
FAOStat_NoCodes
```
## Describe the data
The FAOSTAT_livestock data contains information for 253 United Nations Food & Agriculture Organization countries, on livestock animals such as asses (donkeys?), buffaloes, camels, cattle, goats, horses, mules, pigs, and sheep (measured in heads.) The temporal coverage is from 1961-2018. The data were likely reported by the country (based on the "Flag Description" field denoting as "offical data") or estimated by the FAO ("FAO estimate".)
```{r}
#| label: summary
n_distinct(select(FAOStat_NoCodes, "Area"))
# table(select(FAOStat_NoCodes,"Element"))
table(select(FAOStat_NoCodes, "Item"))
table(select(FAOStat_NoCodes, "Year"))
```
## Provide Grouped Summary Statistics
Conduct some exploratory data analysis, using dplyr commands such as `group_by()`, `select()`, `filter()`, and `summarise()`. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
I first grouped the data by Item (which denotes what type of livestock) and found the average for each, then I sorted them in descending order (using tally and sort). I was able to show for each country and each item, then it occurred to me that it would be interesting to just see the top producing livestock for each country. This then allows us to see the difference between the overall trend for all FAO countries, and how countries differ from that (even by how much.) Future questions might be to compare by year.
```{r}
# make a cleaner data to make a summary table, this can also be done with select()
codevars1 <- names(FAOstat) %in% c("Domain", "Domain Code","Area Code", "Element", "Element Code", "Item Code", "Unit", "Year Code", "Flag", "Flag Description")
FAOStat_cleaned <- FAOstat[!codevars1]
FAOStat_cleaned
# summary of cleaner data
library(summarytools)
print(dfSummary(FAOStat_cleaned))
# group by the type of livestock (= Item)
by_item <- FAOStat_cleaned %>% group_by(Item)
#by_item %>% tally(sort = TRUE)
#by_item %>%
#summarize(avg_value = mean(Value, na.rm=TRUE))
#For each country, group by item, and show the top average value livestock for each country.
item_by_area <- FAOStat_cleaned %>%
group_by(Area, Item) %>%
summarize(mean_value = mean(Value, na.rm =TRUE),
median_value = median(Value, na.rm = TRUE)) %>%
top_n(1)
item_by_area
```
### Explain and Interpret
Please see above section.