Challenge 2: Finding the Cattle Dairy Value statistics for each country from 1961 - 2018

challenge_2

faostat

faostat_cattle_diary

Shantanu Patil

Data wrangling and Data Sumarization to find statistics in data

Author

Shantanu Patil

Published

March 1, 2023

Code

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read in the Data

I have selected faostat_cattle_diary dataset to do my analysis on.

Code

# Importing the faostat_cattle_diary dataset and reading them
library(readr)
data <-read_csv("_data/FAOSTAT_cattle_dairy.csv")
#Lets see the first 3 rows of the dataset
head(data, 3)

# A tibble: 3 × 14
  Domai…¹ Domain Area …² Area  Eleme…³ Element Item …⁴ Item  Year …⁵  Year Unit 
  <chr>   <chr>    <dbl> <chr>   <dbl> <chr>     <dbl> <chr>   <dbl> <dbl> <chr>
1 QL      Lives…       2 Afgh…    5318 Milk A…     882 Milk…    1961  1961 Head 
2 QL      Lives…       2 Afgh…    5420 Yield       882 Milk…    1961  1961 hg/An
3 QL      Lives…       2 Afgh…    5510 Produc…     882 Milk…    1961  1961 tonn…
# … with 3 more variables: Value <dbl>, Flag <chr>, `Flag Description` <chr>,
#   and abbreviated variable names ¹`Domain Code`, ²`Area Code`,
#   ³`Element Code`, ⁴`Item Code`, ⁵`Year Code`

Describe the data

Using R commands I am exploring the data and coming up with the description from the data. The dataset contains 14 column, 8 of which have character datatype and 6 have integer data type. Here are the Columns

chr (8): Domain Code, Domain, Area, Element, Item, Unit, Flag, Flag Description

dbl (6): Area Code, Element Code, Item Code, Year Code, Year, Value

Here is the size of the Dataset — Rows: 36449 Columns: 14

Checking the variation in data for each column

Code

# Check the variation in data for each column
summary(data)

 Domain Code           Domain            Area Code          Area          
 Length:36449       Length:36449       Min.   :   1.0   Length:36449      
 Class :character   Class :character   1st Qu.:  69.0   Class :character  
 Mode  :character   Mode  :character   Median : 141.0   Mode  :character  
                                       Mean   : 775.2                     
                                       3rd Qu.: 215.0                     
                                       Max.   :5504.0                     
                                                                          
  Element Code    Element            Item Code       Item          
 Min.   :5318   Length:36449       Min.   :882   Length:36449      
 1st Qu.:5318   Class :character   1st Qu.:882   Class :character  
 Median :5420   Mode  :character   Median :882   Mode  :character  
 Mean   :5416                      Mean   :882                     
 3rd Qu.:5510                      3rd Qu.:882                     
 Max.   :5510                      Max.   :882                     
                                                                   
   Year Code         Year          Unit               Value          
 Min.   :1961   Min.   :1961   Length:36449       Min.   :        7  
 1st Qu.:1976   1st Qu.:1976   Class :character   1st Qu.:     7849  
 Median :1991   Median :1991   Mode  :character   Median :    43266  
 Mean   :1990   Mean   :1990                      Mean   :  4410235  
 3rd Qu.:2005   3rd Qu.:2005                      3rd Qu.:   700000  
 Max.   :2018   Max.   :2018                      Max.   :683217055  
                                                  NA's   :74         
     Flag           Flag Description  
 Length:36449       Length:36449      
 Class :character   Class :character  
 Mode  :character   Mode  :character

Provide Grouped Summary Statistics

Conducting exploratory data analysis, using dplyr commands such as group_by(), select(), filter(), and summarise().

Remove columns with “Item Code” values as it has constant value 882

Code

#install.packages("dplyr")
library(dplyr)
#remove columns with "Item Code" values as it has contant value 882
data<-data%>%
  select(-contains("Item Code"))

Grouping by the Area and finding the mean_value

Code

data %>% 
  group_by(Area) %>% 
  summarise(mean_value = mean(Value))

# A tibble: 232 × 2
   Area                mean_value
   <chr>                    <dbl>
 1 Afghanistan            903662.
 2 Africa               18861019.
 3 Albania                275446.
 4 Algeria                599219.
 5 American Samoa           3611.
 6 Americas             56060267.
 7 Angola                 164004.
 8 Antigua and Barbuda      6402.
 9 Argentina             3279105.
10 Armenia                276119.
# … with 222 more rows

Table output by selecting Columns “Area”, “Year”, and “Value”.

Code

data %>% 
  select(Area, Year, Value)

# A tibble: 36,449 × 3
   Area         Year  Value
   <chr>       <dbl>  <dbl>
 1 Afghanistan  1961 700000
 2 Afghanistan  1961   5000
 3 Afghanistan  1961 350000
 4 Afghanistan  1962 700000
 5 Afghanistan  1962   5000
 6 Afghanistan  1962 350000
 7 Afghanistan  1963 780000
 8 Afghanistan  1963   5128
 9 Afghanistan  1963 400000
10 Afghanistan  1964 780000
# … with 36,439 more rows

I wanted to check out how any data rows / points are available for Antigua and Barbuda in the year of 1975. There are 3 datapoints available.

Code

data %>% 
  filter(Area == "Antigua and Barbuda" , Year == 1975)

# A tibble: 3 × 13
  Domain …¹ Domain Area …² Area  Eleme…³ Element Item  Year …⁴  Year Unit  Value
  <chr>     <chr>    <dbl> <chr>   <dbl> <chr>   <chr>   <dbl> <dbl> <chr> <dbl>
1 QL        Lives…       8 Anti…    5318 Milk A… Milk…    1975  1975 Head   3100
2 QL        Lives…       8 Anti…    5420 Yield   Milk…    1975  1975 hg/An 10645
3 QL        Lives…       8 Anti…    5510 Produc… Milk…    1975  1975 tonn…  3300
# … with 2 more variables: Flag <chr>, `Flag Description` <chr>, and
#   abbreviated variable names ¹`Domain Code`, ²`Area Code`, ³`Element Code`,
#   ⁴`Year Code`

Table Value Statistics

Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for Value within the data set.

Code

#deleting the rows with Null values
data2 <- na.omit(data)
# Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
data2%>%
  group_by(Area)%>%
  summarize(mean.Value = mean(Value, na.rm = TRUE), median.Value = median(Value, na.rm = TRUE), min.Value = min(Value, na.rm = TRUE), max.Value = max(Value, na.rm = TRUE), sd.Value = sd(Value, na.rm = TRUE), var.Value = var(Value, na.rm = TRUE), IQR.Value = IQR(Value, na.rm = TRUE))

# A tibble: 231 × 8
   Area                mean.Va…¹ media…² min.V…³ max.V…⁴ sd.Va…⁵ var.V…⁶ IQR.V…⁷
   <chr>                   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 Afghanistan           873703.  5.56e5    3592  4.05e6  1.07e6 1.14e12  1.19e6
 2 Africa              18861019.  1.51e7    4311  7.13e7  1.93e7 3.73e14  3.11e7
 3 Albania                88480.  2.53e4    6350  3.24e5  9.90e4 9.81e 9  1.52e5
 4 Algeria               277666.  1.92e5    6008  1.09e6  2.92e5 8.52e10  5.28e5
 5 American Samoa          3652.  3.15e1      16  1.38e4  5.37e3 2.88e 7  7.98e3
 6 Americas            56060267.  4.51e7   22342  1.84e8  5.50e7 3.03e15  9.12e7
 7 Angola                164317.  1.57e5    4800  4.36e5  1.36e5 1.85e10  2.8 e5
 8 Antigua and Barbuda     6414.  6.2 e3    2300  1.06e4  2.67e3 7.15e 6  5.64e3
 9 Argentina            1242484.  5.80e4   16980  8.1 e6  1.40e6 1.95e12  2.37e6
10 Armenia                47702.  1.95e4   14937  4.44e5  1.08e5 1.18e10  3.45e3
# … with 221 more rows, and abbreviated variable names ¹mean.Value,
#   ²median.Value, ³min.Value, ⁴max.Value, ⁵sd.Value, ⁶var.Value, ⁷IQR.Value

I wanted to check out how any data rows / points are available for Antigua and Barbuda in the year of 1975. There are 3 datapoints available. Some Conclusion from seeing the data summary

The Data is Collected from the year 1961 - 2018
The Item Code column just has one value 882
The Area Code varies from 1 - 5504
The Element Code varies frojm 5318 - 5510
The Value Column has a lot of variance

I deleted the Item Code Column.

I choose Area (Country) as a group because I wanted to see how does the Value of output of cattle and diary farms vary based on teh country they are located in. I think every country has a set of climatic condition, thus the value in tropical countries might be far greater than countries near north pole, I wanted to test this hypothesis.

I found out the relationship is not linear, on seeing the data I realized that some countries might be the around the same distance from equator or have same climatic conditions but still the mean value out put varies. Ex Afghanistan and Africa may have somewhat same climatic conditions but as afganisatn is located i the mountains the mean Value is much lower compared to Africa.