# A tibble: 82,116 × 14
Domain Cod…¹ Domain Area …² Area Eleme…³ Element Item …⁴ Item Year …⁵ Year
<chr> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl>
1 QA Live … 2 Afgh… 5111 Stocks 1107 Asses 1961 1961
2 QA Live … 2 Afgh… 5111 Stocks 1107 Asses 1962 1962
3 QA Live … 2 Afgh… 5111 Stocks 1107 Asses 1963 1963
4 QA Live … 2 Afgh… 5111 Stocks 1107 Asses 1964 1964
5 QA Live … 2 Afgh… 5111 Stocks 1107 Asses 1965 1965
6 QA Live … 2 Afgh… 5111 Stocks 1107 Asses 1966 1966
7 QA Live … 2 Afgh… 5111 Stocks 1107 Asses 1967 1967
8 QA Live … 2 Afgh… 5111 Stocks 1107 Asses 1968 1968
9 QA Live … 2 Afgh… 5111 Stocks 1107 Asses 1969 1969
10 QA Live … 2 Afgh… 5111 Stocks 1107 Asses 1970 1970
# … with 82,106 more rows, 4 more variables: Unit <chr>, Value <dbl>,
# Flag <chr>, `Flag Description` <chr>, and abbreviated variable names
# ¹`Domain Code`, ²`Area Code`, ³`Element Code`, ⁴`Item Code`, ⁵`Year Code`
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
This dataset comes from the Food and Agriculture Association of the United Nations. They publish country-level data regularly and I am going to be looking at country-level estimates of the number of animals that are raised for livestock. We can see that there are 82116 rows in the livestock data.
Describe the data
Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).
Code
spec(FAO)
cols(
`Domain Code` = col_character(),
Domain = col_character(),
`Area Code` = col_double(),
Area = col_character(),
`Element Code` = col_double(),
Element = col_character(),
`Item Code` = col_double(),
Item = col_character(),
`Year Code` = col_double(),
Year = col_double(),
Unit = col_character(),
Value = col_double(),
Flag = col_character(),
`Flag Description` = col_character()
)
Code
FAO.sm <- FAO %>%select(-contains("Code"))FAO.sm
# A tibble: 82,116 × 9
Domain Area Element Item Year Unit Value Flag Flag Descr…¹
<chr> <chr> <chr> <chr> <dbl> <chr> <dbl> <chr> <chr>
1 Live Animals Afghanistan Stocks Asses 1961 Head 1300000 <NA> Official da…
2 Live Animals Afghanistan Stocks Asses 1962 Head 851850 <NA> Official da…
3 Live Animals Afghanistan Stocks Asses 1963 Head 1001112 <NA> Official da…
4 Live Animals Afghanistan Stocks Asses 1964 Head 1150000 F FAO estimate
5 Live Animals Afghanistan Stocks Asses 1965 Head 1300000 <NA> Official da…
6 Live Animals Afghanistan Stocks Asses 1966 Head 1200000 <NA> Official da…
7 Live Animals Afghanistan Stocks Asses 1967 Head 1200000 <NA> Official da…
8 Live Animals Afghanistan Stocks Asses 1968 Head 1328000 <NA> Official da…
9 Live Animals Afghanistan Stocks Asses 1969 Head 1250000 <NA> Official da…
10 Live Animals Afghanistan Stocks Asses 1970 Head 1300000 <NA> Official da…
# … with 82,106 more rows, and abbreviated variable name ¹`Flag Description`
# ℹ Use `print(n = ...)` to see more rows
Generated by summarytools 1.0.1 (R version 4.2.1) 2022-09-01
Based on the results of the spec() function, I can see that there are six variables that are type double and eight that are type character. Out of the six double() variables, Area Code, Year and Item Code are all good grouping variables because they do not have values that vary across rows. I dropped the double() variables that contain code because they are just numeric codes for database management purposes. Using summarytools(), I can say that the records in this dataset are the number of Live Animal Stocks and the units of the values is Head. Each case in this dataset consists of an animal record based on the country and year that tries to estimate the number of live animals which is represented by Value. In total, I have estimates of the stock of nine different types of livestock (Asses, Buffaloes, Camels, Cattle, Goats, Horses, Mules, Pigs, Sheep ) in 253 areas for 58 years. The flags correspond to what type of estimate is being used.
# A tibble: 28 × 2
Area n
<chr> <int>
1 Africa 522
2 Americas 464
3 Asia 522
4 Australia and New Zealand 376
5 Caribbean 464
6 Central America 406
7 Central Asia 243
8 Eastern Africa 522
9 Eastern Asia 522
10 Eastern Europe 522
# … with 18 more rows
# ℹ Use `print(n = ...)` to see more rows
Code
FAO_clc <- FAO.sm %>%filter(Flag!="A")FAO_clc
# A tibble: 31,279 × 9
Domain Area Element Item Year Unit Value Flag Flag Descr…¹
<chr> <chr> <chr> <chr> <dbl> <chr> <dbl> <chr> <chr>
1 Live Animals Afghanistan Stocks Asses 1964 Head 1150000 F FAO estimate
2 Live Animals Afghanistan Stocks Asses 1973 Head 1250000 F FAO estimate
3 Live Animals Afghanistan Stocks Asses 1974 Head 1250000 F FAO estimate
4 Live Animals Afghanistan Stocks Asses 1975 Head 1250000 F FAO estimate
5 Live Animals Afghanistan Stocks Asses 1976 Head 1250000 F FAO estimate
6 Live Animals Afghanistan Stocks Asses 1978 Head 1300000 * Unofficial …
7 Live Animals Afghanistan Stocks Asses 1979 Head 1300000 * Unofficial …
8 Live Animals Afghanistan Stocks Asses 1980 Head 1295000 * Unofficial …
9 Live Animals Afghanistan Stocks Asses 1981 Head 1315000 * Unofficial …
10 Live Animals Afghanistan Stocks Asses 1982 Head 1315000 * Unofficial …
# … with 31,269 more rows, and abbreviated variable name ¹`Flag Description`
# ℹ Use `print(n = ...)` to see more rows
Here we can confirm that not all cases are countries. Flag Value A corresponds to Areas that are actually regional aggregations. These should be filtered out if I want to keep the same type of case as a country-level case. The second filter statement removes all cases with Flag Value A so that our dataset is at a country-level case. It seems like the distribution of cases for regional aggregations is even except for Areas Melanesia and Micronesia. FAO_clc is more specific version of the dataset that only includes the cases that are type country-level. I have conducted exploartory analysis on FAO_clc on the group Item and my first impression was how vastly different the mean and median were for each Item. This implies that our data is skewed in one direction. I also see that each Item stdev is really high which indicates that the data observed is quite spread out. The min and max values tell little about the dataset
Source Code
---title: "Challenge 2 Instructions"author: "Meredith Rolfe"desription: "Data wrangling: using group() and summarise()"date: "08/16/2022"format: html: toc: true code-fold: true code-copy: true code-tools: truecategories: - challenge_2 - FAO---```{r}#| label: setup#| warning: false#| message: falselibrary(tidyverse)library(summarytools)knitr::opts_chunk$set(echo =TRUE, warning=FALSE, message=FALSE)```## Read in Data```{r}FAO <-read_csv("_data/FAOSTAT_livestock.csv")FAO```This dataset comes from the Food and Agriculture Association of the United Nations. They publish country-level data regularly and I am going to be looking at country-level estimates of the number of animals that are raised for livestock. We can see that there are `r nrow(FAO)` rows in the livestock data.## Describe the dataUsing a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).```{r}#| label: summaryspec(FAO)FAO.sm <- FAO %>%select(-contains("Code"))FAO.smprint(dfSummary(FAO.sm, varnumbers =FALSE,plain.ascii =FALSE, style ="grid", graph.magnif =0.70, valid.col =FALSE),method ='render',table.classes ='table-condensed')```Based on the results of the spec() function, I can see that there are six variables that are type double and eight that are type character. Out of the six double() variables, *Area Code*, *Year* and *Item Code* are all good grouping variables because they do not have values that vary across rows. I dropped the double() variables that contain **code** because they are just numeric codes for database management purposes. Using summarytools(), I can say that the records in this dataset are the number of Live Animal Stocks and the units of the values is Head. Each case in this dataset consists of an animal record based on the country and year that tries to estimate the number of live animals which is represented by Value. In total, I have estimates of the stock of nine different types of livestock (Asses, Buffaloes, Camels, Cattle, Goats, Horses, Mules, Pigs, Sheep ) in 253 areas for 58 years. The flags correspond to what type of estimate is being used. ## Provide Grouped Summary Statistics```{r}FAO.sm %>%filter(Flag=="A")%>%group_by(Area)%>%summarize(n=n())FAO_clc <- FAO.sm %>%filter(Flag!="A")FAO_clc FAO_clc %>%group_by(Item) %>%summarize(avg=mean(Value, na.rm =TRUE),mode =n(),median =median(Value, na.rm =TRUE),stdev=sd(Value, na.rm =TRUE),min =min(Value, na.rm =TRUE),max =max(Value, na.rm =TRUE))```### Explain and InterpretHere we can confirm that not all cases are countries. Flag Value *A* corresponds to *Areas* that are actually regional aggregations. These should be filtered out if I want to keep the same type of case as a country-level case. The second filter statement removes all cases with Flag Value *A* so that our dataset is at a country-level case. It seems like the distribution of cases for regional aggregations is even except for Areas Melanesia and Micronesia. FAO_clc is more specific version of the dataset that only includes the cases that are type country-level. I have conducted exploartory analysis on FAO_clc on the group *Item* and my first impression was how vastly different the mean and median were for each *Item*. This implies that our data is skewed in one direction. I also see that each *Item* stdev is really high which indicates that the data observed is quite spread out. The min and max values tell little about the dataset