Reading in Data
Author

Daniel Manning

Published

January 7, 2023

Code
library(tidyverse)
library(here)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Tidying of Dataset

First I loaded the “FAOSTAT_cattle_dairy.csv dataset. This dataset was already largely tidy, but I finished the process by removing variables that were redundant or only had one case, leaving each column with its own variable and each row as an occurrence.

Code
cattle_dairy <- here("posts","_data","FAOSTAT_cattle_dairy.csv")%>%
  read_csv()
cattle_dairy
# A tibble: 36,449 × 14
   Domain Cod…¹ Domain Area …² Area  Eleme…³ Element Item …⁴ Item  Year …⁵  Year
   <chr>        <chr>    <dbl> <chr>   <dbl> <chr>     <dbl> <chr>   <dbl> <dbl>
 1 QL           Lives…       2 Afgh…    5318 Milk A…     882 Milk…    1961  1961
 2 QL           Lives…       2 Afgh…    5420 Yield       882 Milk…    1961  1961
 3 QL           Lives…       2 Afgh…    5510 Produc…     882 Milk…    1961  1961
 4 QL           Lives…       2 Afgh…    5318 Milk A…     882 Milk…    1962  1962
 5 QL           Lives…       2 Afgh…    5420 Yield       882 Milk…    1962  1962
 6 QL           Lives…       2 Afgh…    5510 Produc…     882 Milk…    1962  1962
 7 QL           Lives…       2 Afgh…    5318 Milk A…     882 Milk…    1963  1963
 8 QL           Lives…       2 Afgh…    5420 Yield       882 Milk…    1963  1963
 9 QL           Lives…       2 Afgh…    5510 Produc…     882 Milk…    1963  1963
10 QL           Lives…       2 Afgh…    5318 Milk A…     882 Milk…    1964  1964
# … with 36,439 more rows, 4 more variables: Unit <chr>, Value <dbl>,
#   Flag <chr>, `Flag Description` <chr>, and abbreviated variable names
#   ¹​`Domain Code`, ²​`Area Code`, ³​`Element Code`, ⁴​`Item Code`, ⁵​`Year Code`
Code
cattle_dairy_new <- cattle_dairy %>% 
  select(-c('Domain', 'Domain Code', 'Area Code', 'Element', 'Element Code', 'Item Code', 'Year Code', 'Unit', 'Flag'))
cattle_dairy_new
# A tibble: 36,449 × 5
   Area        Item                   Year  Value `Flag Description`
   <chr>       <chr>                 <dbl>  <dbl> <chr>             
 1 Afghanistan Milk, whole fresh cow  1961 700000 FAO estimate      
 2 Afghanistan Milk, whole fresh cow  1961   5000 Calculated data   
 3 Afghanistan Milk, whole fresh cow  1961 350000 FAO estimate      
 4 Afghanistan Milk, whole fresh cow  1962 700000 FAO estimate      
 5 Afghanistan Milk, whole fresh cow  1962   5000 Calculated data   
 6 Afghanistan Milk, whole fresh cow  1962 350000 FAO estimate      
 7 Afghanistan Milk, whole fresh cow  1963 780000 FAO estimate      
 8 Afghanistan Milk, whole fresh cow  1963   5128 Calculated data   
 9 Afghanistan Milk, whole fresh cow  1963 400000 FAO estimate      
10 Afghanistan Milk, whole fresh cow  1964 780000 FAO estimate      
# … with 36,439 more rows

Narrative, Variables, and Research Question

This dataset consists of five variables with the following types: Area: string Item: string Year: double Value: double Flag Description: string

The overall dataset represents values for whole fresh cow milk in different countries across different years, as well as whether the data was estiamted/calculated.

One potential research question could investigate the difference in values with respect to the area. For example: Which area produced the most whole fresh cow milk? What country produced the least whole fresh cow milk? What was the distribution of average whole fresh cow milk production for each area?

Another research question could investigate the change in milk production with respect to year. For example: Did the production of whole fresh cow milk increase from 1961 to 2018? During which year was the most cow milk produced and during which year was the least produced?

A third research question could investigate the relationship between the method of estimating/calculating values and the values themselves. For example: Is there a difference between the average of values described as “Calculated data” and “FAO estimates”.

Lastly, this dataset could be used to investigate relationships between multiple variables, such as year and area. For example: Did some areas see an increase in production from 1961 to 2018 while others saw a decrease? Was the magnitude of change in production different across areas?