challenge_2
FAOSTAT_cattle_dairy.csv
Author

Mani Shanker Kamarapu

Published

August 16, 2022

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read in the Data

Code
FAOSTAT_cattle_dairy <- read_csv("_data/FAOSTAT_cattle_dairy.csv")
View(FAOSTAT_cattle_dairy)

As we read the data, there are many interesting variables and a lot of observations based on yearly analysis of the various cattle products.

Describe the data

Code
dim(FAOSTAT_cattle_dairy)
[1] 36449    14

We can see that there are 36449 rows and 14 columns in the data. We can see from observations from different variables that “Value” variable is pretty much the only real valuable data we are getting from this data set and remaining variables are either unchanged(Domain and Item) or grouping variables(Area, Element, Year and Flag) which can be used to summarize data and find the categories we are interested in and also can be used to find specific information we need and let’s try to get to some more useful info using some other functions.

Code
colnames(FAOSTAT_cattle_dairy)
 [1] "Domain Code"      "Domain"           "Area Code"        "Area"            
 [5] "Element Code"     "Element"          "Item Code"        "Item"            
 [9] "Year Code"        "Year"             "Unit"             "Value"           
[13] "Flag"             "Flag Description"

These are the variables available in the data set.

Code
summary(filter(select(FAOSTAT_cattle_dairy, Value)))
     Value          
 Min.   :        7  
 1st Qu.:     7849  
 Median :    43266  
 Mean   :  4410235  
 3rd Qu.:   700000  
 Max.   :683217055  
 NA's   :74         

It is the summary of the Value variable of the data set which is the only real data variable in data set.

Code
Area_num <- distinct(FAOSTAT_cattle_dairy, Area)
Element_num <- distinct(FAOSTAT_cattle_dairy, Element)
Year_num <- distinct(FAOSTAT_cattle_dairy, Year)
Flag_num <- distinct(FAOSTAT_cattle_dairy, Flag)
Distinct_count <- c(nrow(Area_num), nrow(Element_num), nrow(Year_num), nrow(Flag_num))
Distinct_count
[1] 232   3  58   7

We can observe that data set has the data from 1961-2018 on 3 elements at 232 different areas.

I am interested in finding in more information about Flag and how it corresponds to the description they have given and can we group the values by using it.

Code
FLag_info <- FAOSTAT_cattle_dairy %>%
  select(Flag,`Flag Description`)
unique(FLag_info)
# A tibble: 7 × 2
  Flag  `Flag Description`                                                      
  <chr> <chr>                                                                   
1 F     FAO estimate                                                            
2 Fc    Calculated data                                                         
3 <NA>  Official data                                                           
4 *     Unofficial figure                                                       
5 Im    FAO data based on imputation methodology                                
6 M     Data not available                                                      
7 A     Aggregate, may include official, semi-official, estimated or calculated…

From the above observation we got to know there are 7 types of flags and it is based on data provided and type of data and values they are and how is has been taken.

Provide Grouped Summary Statistics

Area and Element wise analysis

Code
AE_analysis <- FAOSTAT_cattle_dairy %>%
  group_by(Area, Element) %>%
  summarise(mean_value=mean(Value, na.rm = TRUE), median_value=median(Value, na.rm = TRUE), sd_value=sd(Value, na.rm = TRUE), min_value=min(Value, na.rm = TRUE), max_value=max(Value, na.rm = TRUE), IQR_value=IQR(Value, na.rm = TRUE))
AE_analysis
# A tibble: 695 × 8
# Groups:   Area [232]
   Area        Element      mean_value median_…¹ sd_va…² min_v…³ max_v…⁴ IQR_v…⁵
   <chr>       <chr>             <dbl>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 Afghanistan Milk Animals   1790863.  1150000   1.08e6  7   e5  4.05e6  1.57e6
 2 Afghanistan Production      914876.   600000   5.01e5  3   e5  1.87e6  8.86e5
 3 Afghanistan Yield             5246.     5128   9.06e2  3.59e3  7.33e3  6.35e2
 4 Africa      Milk Animals  37629052. 33173522.  1.75e7  1.70e7  7.13e7  3.00e7
 5 Africa      Production    18949137. 15058869   1.02e7  7.65e6  3.68e7  1.88e7
 6 Africa      Yield             4869.     4600   4.39e2  4.31e3  5.84e3  6.93e2
 7 Albania     Milk Animals    288774.   284650   1.13e5  1.17e5  4.88e5  1.92e5
 8 Albania     Production      521367.   392500   3.37e5  7.83e4  9.83e5  6.78e5
 9 Albania     Yield            16197     13857   6.61e3  6.35e3  2.84e4  8.11e3
10 Algeria     Milk Animals    659719.   640000   2.50e5  2.63e5  1.11e6  4.14e5
# … with 685 more rows, and abbreviated variable names ¹​median_value,
#   ²​sd_value, ³​min_value, ⁴​max_value, ⁵​IQR_value
# ℹ Use `print(n = ...)` to see more rows

As per the analysis, in each area the we have three kinds of elements and there are a lot of variations and mean value is not dependent to element itself in any area but it differs in each area and we can get a deeper understanding when we analyse the area and element separately.

Year and Element wise analysis

Code
YE_analysis <- FAOSTAT_cattle_dairy %>%
  group_by(Year, Element) %>%
  summarise(mean_value=mean(Value, na.rm = TRUE), median_value=median(Value, na.rm = TRUE), sd_value=sd(Value, na.rm = TRUE), min_value=min(Value, na.rm = TRUE), max_value=max(Value, na.rm = TRUE), IQR_value=IQR(Value, na.rm = TRUE))
YE_analysis
# A tibble: 174 × 8
# Groups:   Year [58]
    Year Element      mean_value median_value  sd_value min_va…¹ max_v…² IQR_v…³
   <dbl> <chr>             <dbl>        <dbl>     <dbl>    <dbl>   <dbl>   <dbl>
 1  1961 Milk Animals   3582516.       200550 15152245.        8  1.77e8  1.23e6
 2  1961 Production     6335891.       100368 28456161.        7  3.14e8  1.74e6
 3  1961 Yield            11949.         8812     9905.     1200  4.29e4  1.41e4
 4  1962 Milk Animals   3567116.       200850 15117111.       17  1.77e8  1.21e6
 5  1962 Production     6384721.       100000 28667301.       23  3.16e8  1.76e6
 6  1962 Yield            12105.         8899    10173.     1200  4.43e4  1.41e4
 7  1963 Milk Animals   3578137.       219550 15189434.       20  1.77e8  1.21e6
 8  1963 Production     6319684.       102500 28305726.       20  3.13e8  1.78e6
 9  1963 Yield            12147.         9051    10241.     1191  4.59e4  1.44e4
10  1964 Milk Animals   3554094.       215650 15090377.       25  1.76e8  1.25e6
# … with 164 more rows, and abbreviated variable names ¹​min_value, ²​max_value,
#   ³​IQR_value
# ℹ Use `print(n = ...)` to see more rows

As per the data, we can say that the elements has been increased by passing year, there are some fluctuations in between but there is an overall increase by the year.

Element-wise analysis

Code
E_analysis <- FAOSTAT_cattle_dairy %>%
  group_by(Element) %>%
  summarise(mean_value=mean(Value, na.rm = TRUE), median_value=median(Value, na.rm = TRUE), sd_value=sd(Value, na.rm = TRUE), min_value=min(Value, na.rm = TRUE), max_value=max(Value, na.rm = TRUE), IQR_value=IQR(Value, na.rm = TRUE)) %>%
  arrange(desc(mean_value))
E_analysis
# A tibble: 3 × 7
  Element      mean_value median_value  sd_value min_value max_value IQR_value
  <chr>             <dbl>        <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
1 Production     9001419.       295500 40268994.         7 683217055   2736860
2 Milk Animals   4205410.       295000 18041595.         8 276573845   1525233
3 Yield            19329.        13218    19361.       923    134121     21367

As per the above observation, the mean value of the production is maximum and yield is the minimum. It shows that the production value is more with the less yield and it leads to an overall profit.

Area-wise analysis

Code
A_analysis <- FAOSTAT_cattle_dairy %>%
  group_by(Area) %>%
  summarise(mean_value=mean(Value, na.rm = TRUE), median_value=median(Value, na.rm = TRUE), sd_value=sd(Value, na.rm = TRUE), min_value=min(Value, na.rm = TRUE), max_value=max(Value, na.rm = TRUE), IQR_value=IQR(Value, na.rm = TRUE)) %>%
  arrange(desc(mean_value))
A_analysis
# A tibble: 232 × 7
   Area                     mean_value median_…¹ sd_va…² min_v…³ max_v…⁴ IQR_v…⁵
   <chr>                         <dbl>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 World                    230162236.    2.20e8  2.02e8   17662  6.83e8  3.87e8
 2 Europe                    99143192.    8.26e7  9.82e7   21977  2.80e8  2.09e8
 3 Americas                  56060267.    4.51e7  5.50e7   22342  1.84e8  9.12e7
 4 Asia                      48224349.    3.71e7  5.08e7    6034  2.13e8  7.49e7
 5 Eastern Europe            45406646.    4.82e7  4.33e7   16644  1.40e8  7.46e7
 6 USSR                      42870195.    4.18e7  3.68e7   15900  1.08e8  8.09e7
 7 Western Europe            30380343.    1.67e7  3.27e7   30476  8.74e7  7.21e7
 8 Northern America          29827044.    1.14e7  3.49e7   32302  1.06e8  6.43e7
 9 Southern Asia             28165903.    2.50e7  2.75e7    3824  1.17e8  4.64e7
10 United States of America  26633354.    1.00e7  3.15e7   33068  9.87e7  5.61e7
# … with 222 more rows, and abbreviated variable names ¹​median_value,
#   ²​sd_value, ³​min_value, ⁴​max_value, ⁵​IQR_value
# ℹ Use `print(n = ...)` to see more rows

As per the analysis, Europe has the highest cattle dairy and British Virgin Islands has the minimum, we can say none.

Year-wise analysis

Code
Y_analysis <- FAOSTAT_cattle_dairy %>%
  group_by(Year) %>%
  summarise(mean_value=mean(Value, na.rm = TRUE), median_value=median(Value, na.rm = TRUE), sd_value=sd(Value, na.rm = TRUE), min_value=min(Value, na.rm = TRUE), max_value=max(Value, na.rm = TRUE), IQR_value=IQR(Value, na.rm = TRUE)) %>%
  arrange(desc(mean_value))
Y_analysis
# A tibble: 58 × 7
    Year mean_value median_value  sd_value min_value max_value IQR_value
   <dbl>      <dbl>        <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
 1  2017   5682309.       80522. 33349614.        20 677670685   958680 
 2  2018   5654115.       79029  33488177.        21 683217055   938124 
 3  2016   5617407.       77378. 32809823.        20 665596536   981852.
 4  2015   5586308.       78062. 32577146.        20 661430554   964494 
 5  2014   5534018.       77490  32246947.        20 655245580   947544.
 6  2013   5411995.       75958. 31386003.        20 635379383   964509 
 7  2012   5372784.       74057  31140061.        20 630244839   946680 
 8  2011   5287165.       71750  30561395.        20 616177381   953420.
 9  2010   5185483.       70586  29911569.        20 601868328   928111 
10  2009   5088955.       70375  29357924.         9 590471016   895700 
# … with 48 more rows
# ℹ Use `print(n = ...)` to see more rows

As per the data, the cattle dairy production has been increased by the passing year, there are some fluctuations in middle but the overall it is increasing, we can see from the data that the increase is nearly 85 percentage from 1961 to 2018.

Explain and Interpret

I chose Element, Area and Year subgroups from data set. The reason for choosing them is there importance in analysis and also tried to analysis using two variables to know the dependency of values on different variables. So I have done Area and Element wise analysis, Year and Element wise analysis, Element-wise analysis, Area-wise analysis and Year-wise analysis. Conclusion to my analysis would be Europe has the highest of the cattle dairy and the production of the dairy is increasing gradually as by the year.