Code
library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Mani Shanker Kamarapu
August 16, 2022
As we read the data, there are many interesting variables and a lot of observations based on yearly analysis of the various cattle products.
We can see that there are 36449 rows and 14 columns in the data. We can see from observations from different variables that “Value” variable is pretty much the only real valuable data we are getting from this data set and remaining variables are either unchanged(Domain and Item) or grouping variables(Area, Element, Year and Flag) which can be used to summarize data and find the categories we are interested in and also can be used to find specific information we need and let’s try to get to some more useful info using some other functions.
[1] "Domain Code" "Domain" "Area Code" "Area"
[5] "Element Code" "Element" "Item Code" "Item"
[9] "Year Code" "Year" "Unit" "Value"
[13] "Flag" "Flag Description"
These are the variables available in the data set.
Value
Min. : 7
1st Qu.: 7849
Median : 43266
Mean : 4410235
3rd Qu.: 700000
Max. :683217055
NA's :74
It is the summary of the Value variable of the data set which is the only real data variable in data set.
Area_num <- distinct(FAOSTAT_cattle_dairy, Area)
Element_num <- distinct(FAOSTAT_cattle_dairy, Element)
Year_num <- distinct(FAOSTAT_cattle_dairy, Year)
Flag_num <- distinct(FAOSTAT_cattle_dairy, Flag)
Distinct_count <- c(nrow(Area_num), nrow(Element_num), nrow(Year_num), nrow(Flag_num))
Distinct_count
[1] 232 3 58 7
We can observe that data set has the data from 1961-2018 on 3 elements at 232 different areas.
I am interested in finding in more information about Flag and how it corresponds to the description they have given and can we group the values by using it.
# A tibble: 7 × 2
Flag `Flag Description`
<chr> <chr>
1 F FAO estimate
2 Fc Calculated data
3 <NA> Official data
4 * Unofficial figure
5 Im FAO data based on imputation methodology
6 M Data not available
7 A Aggregate, may include official, semi-official, estimated or calculated…
From the above observation we got to know there are 7 types of flags and it is based on data provided and type of data and values they are and how is has been taken.
AE_analysis <- FAOSTAT_cattle_dairy %>%
group_by(Area, Element) %>%
summarise(mean_value=mean(Value, na.rm = TRUE), median_value=median(Value, na.rm = TRUE), sd_value=sd(Value, na.rm = TRUE), min_value=min(Value, na.rm = TRUE), max_value=max(Value, na.rm = TRUE), IQR_value=IQR(Value, na.rm = TRUE))
AE_analysis
# A tibble: 695 × 8
# Groups: Area [232]
Area Element mean_value median_…¹ sd_va…² min_v…³ max_v…⁴ IQR_v…⁵
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan Milk Animals 1790863. 1150000 1.08e6 7 e5 4.05e6 1.57e6
2 Afghanistan Production 914876. 600000 5.01e5 3 e5 1.87e6 8.86e5
3 Afghanistan Yield 5246. 5128 9.06e2 3.59e3 7.33e3 6.35e2
4 Africa Milk Animals 37629052. 33173522. 1.75e7 1.70e7 7.13e7 3.00e7
5 Africa Production 18949137. 15058869 1.02e7 7.65e6 3.68e7 1.88e7
6 Africa Yield 4869. 4600 4.39e2 4.31e3 5.84e3 6.93e2
7 Albania Milk Animals 288774. 284650 1.13e5 1.17e5 4.88e5 1.92e5
8 Albania Production 521367. 392500 3.37e5 7.83e4 9.83e5 6.78e5
9 Albania Yield 16197 13857 6.61e3 6.35e3 2.84e4 8.11e3
10 Algeria Milk Animals 659719. 640000 2.50e5 2.63e5 1.11e6 4.14e5
# … with 685 more rows, and abbreviated variable names ¹median_value,
# ²sd_value, ³min_value, ⁴max_value, ⁵IQR_value
# ℹ Use `print(n = ...)` to see more rows
As per the analysis, in each area the we have three kinds of elements and there are a lot of variations and mean value is not dependent to element itself in any area but it differs in each area and we can get a deeper understanding when we analyse the area and element separately.
YE_analysis <- FAOSTAT_cattle_dairy %>%
group_by(Year, Element) %>%
summarise(mean_value=mean(Value, na.rm = TRUE), median_value=median(Value, na.rm = TRUE), sd_value=sd(Value, na.rm = TRUE), min_value=min(Value, na.rm = TRUE), max_value=max(Value, na.rm = TRUE), IQR_value=IQR(Value, na.rm = TRUE))
YE_analysis
# A tibble: 174 × 8
# Groups: Year [58]
Year Element mean_value median_value sd_value min_va…¹ max_v…² IQR_v…³
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1961 Milk Animals 3582516. 200550 15152245. 8 1.77e8 1.23e6
2 1961 Production 6335891. 100368 28456161. 7 3.14e8 1.74e6
3 1961 Yield 11949. 8812 9905. 1200 4.29e4 1.41e4
4 1962 Milk Animals 3567116. 200850 15117111. 17 1.77e8 1.21e6
5 1962 Production 6384721. 100000 28667301. 23 3.16e8 1.76e6
6 1962 Yield 12105. 8899 10173. 1200 4.43e4 1.41e4
7 1963 Milk Animals 3578137. 219550 15189434. 20 1.77e8 1.21e6
8 1963 Production 6319684. 102500 28305726. 20 3.13e8 1.78e6
9 1963 Yield 12147. 9051 10241. 1191 4.59e4 1.44e4
10 1964 Milk Animals 3554094. 215650 15090377. 25 1.76e8 1.25e6
# … with 164 more rows, and abbreviated variable names ¹min_value, ²max_value,
# ³IQR_value
# ℹ Use `print(n = ...)` to see more rows
As per the data, we can say that the elements has been increased by passing year, there are some fluctuations in between but there is an overall increase by the year.
E_analysis <- FAOSTAT_cattle_dairy %>%
group_by(Element) %>%
summarise(mean_value=mean(Value, na.rm = TRUE), median_value=median(Value, na.rm = TRUE), sd_value=sd(Value, na.rm = TRUE), min_value=min(Value, na.rm = TRUE), max_value=max(Value, na.rm = TRUE), IQR_value=IQR(Value, na.rm = TRUE)) %>%
arrange(desc(mean_value))
E_analysis
# A tibble: 3 × 7
Element mean_value median_value sd_value min_value max_value IQR_value
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Production 9001419. 295500 40268994. 7 683217055 2736860
2 Milk Animals 4205410. 295000 18041595. 8 276573845 1525233
3 Yield 19329. 13218 19361. 923 134121 21367
As per the above observation, the mean value of the production is maximum and yield is the minimum. It shows that the production value is more with the less yield and it leads to an overall profit.
A_analysis <- FAOSTAT_cattle_dairy %>%
group_by(Area) %>%
summarise(mean_value=mean(Value, na.rm = TRUE), median_value=median(Value, na.rm = TRUE), sd_value=sd(Value, na.rm = TRUE), min_value=min(Value, na.rm = TRUE), max_value=max(Value, na.rm = TRUE), IQR_value=IQR(Value, na.rm = TRUE)) %>%
arrange(desc(mean_value))
A_analysis
# A tibble: 232 × 7
Area mean_value median_…¹ sd_va…² min_v…³ max_v…⁴ IQR_v…⁵
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 World 230162236. 2.20e8 2.02e8 17662 6.83e8 3.87e8
2 Europe 99143192. 8.26e7 9.82e7 21977 2.80e8 2.09e8
3 Americas 56060267. 4.51e7 5.50e7 22342 1.84e8 9.12e7
4 Asia 48224349. 3.71e7 5.08e7 6034 2.13e8 7.49e7
5 Eastern Europe 45406646. 4.82e7 4.33e7 16644 1.40e8 7.46e7
6 USSR 42870195. 4.18e7 3.68e7 15900 1.08e8 8.09e7
7 Western Europe 30380343. 1.67e7 3.27e7 30476 8.74e7 7.21e7
8 Northern America 29827044. 1.14e7 3.49e7 32302 1.06e8 6.43e7
9 Southern Asia 28165903. 2.50e7 2.75e7 3824 1.17e8 4.64e7
10 United States of America 26633354. 1.00e7 3.15e7 33068 9.87e7 5.61e7
# … with 222 more rows, and abbreviated variable names ¹median_value,
# ²sd_value, ³min_value, ⁴max_value, ⁵IQR_value
# ℹ Use `print(n = ...)` to see more rows
As per the analysis, Europe has the highest cattle dairy and British Virgin Islands has the minimum, we can say none.
Y_analysis <- FAOSTAT_cattle_dairy %>%
group_by(Year) %>%
summarise(mean_value=mean(Value, na.rm = TRUE), median_value=median(Value, na.rm = TRUE), sd_value=sd(Value, na.rm = TRUE), min_value=min(Value, na.rm = TRUE), max_value=max(Value, na.rm = TRUE), IQR_value=IQR(Value, na.rm = TRUE)) %>%
arrange(desc(mean_value))
Y_analysis
# A tibble: 58 × 7
Year mean_value median_value sd_value min_value max_value IQR_value
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2017 5682309. 80522. 33349614. 20 677670685 958680
2 2018 5654115. 79029 33488177. 21 683217055 938124
3 2016 5617407. 77378. 32809823. 20 665596536 981852.
4 2015 5586308. 78062. 32577146. 20 661430554 964494
5 2014 5534018. 77490 32246947. 20 655245580 947544.
6 2013 5411995. 75958. 31386003. 20 635379383 964509
7 2012 5372784. 74057 31140061. 20 630244839 946680
8 2011 5287165. 71750 30561395. 20 616177381 953420.
9 2010 5185483. 70586 29911569. 20 601868328 928111
10 2009 5088955. 70375 29357924. 9 590471016 895700
# … with 48 more rows
# ℹ Use `print(n = ...)` to see more rows
As per the data, the cattle dairy production has been increased by the passing year, there are some fluctuations in middle but the overall it is increasing, we can see from the data that the increase is nearly 85 percentage from 1961 to 2018.
I chose Element, Area and Year subgroups from data set. The reason for choosing them is there importance in analysis and also tried to analysis using two variables to know the dependency of values on different variables. So I have done Area and Element wise analysis, Year and Element wise analysis, Element-wise analysis, Area-wise analysis and Year-wise analysis. Conclusion to my analysis would be Europe has the highest of the cattle dairy and the production of the dairy is increasing gradually as by the year.
---
title: "Challenge 2"
author: "Mani Shanker Kamarapu"
desription: "Data wrangling: using group() and summarise()"
date: "08/16/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_2
- FAOSTAT_cattle_dairy.csv
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Read in the Data
```{r}
#| label: Reading data
FAOSTAT_cattle_dairy <- read_csv("_data/FAOSTAT_cattle_dairy.csv")
View(FAOSTAT_cattle_dairy)
```
As we read the data, there are many interesting variables and a lot of observations based on yearly analysis of the various cattle products.
## Describe the data
```{r}
#| label : Dimensions of data set
dim(FAOSTAT_cattle_dairy)
```
We can see that there are 36449 rows and 14 columns in the data. We can see from observations from different variables that "Value" variable is pretty much the only real valuable data we are getting from this data set and remaining variables are either unchanged(Domain and Item) or grouping variables(Area, Element, Year and Flag) which can be used to summarize data and find the categories we are interested in and also can be used to find specific information we need and let's try to get to some more useful info using some other functions.
```{r}
#| label: Diff variables of data set
colnames(FAOSTAT_cattle_dairy)
```
These are the variables available in the data set.
```{r}
#| label: Summary
summary(filter(select(FAOSTAT_cattle_dairy, Value)))
```
It is the summary of the Value variable of the data set which is the only real data variable in data set.
```{r}
#| label : Grouping variables
Area_num <- distinct(FAOSTAT_cattle_dairy, Area)
Element_num <- distinct(FAOSTAT_cattle_dairy, Element)
Year_num <- distinct(FAOSTAT_cattle_dairy, Year)
Flag_num <- distinct(FAOSTAT_cattle_dairy, Flag)
Distinct_count <- c(nrow(Area_num), nrow(Element_num), nrow(Year_num), nrow(Flag_num))
Distinct_count
```
We can observe that data set has the data from 1961-2018 on 3 elements at 232 different areas.
I am interested in finding in more information about Flag and how it corresponds to the description they have given and can we group the values by using it.
```{r}
#| label : Flag information
FLag_info <- FAOSTAT_cattle_dairy %>%
select(Flag,`Flag Description`)
unique(FLag_info)
```
From the above observation we got to know there are 7 types of flags and it is based on data provided and type of data and values they are and how is has been taken.
## Provide Grouped Summary Statistics
#### Area and Element wise analysis
```{r}
AE_analysis <- FAOSTAT_cattle_dairy %>%
group_by(Area, Element) %>%
summarise(mean_value=mean(Value, na.rm = TRUE), median_value=median(Value, na.rm = TRUE), sd_value=sd(Value, na.rm = TRUE), min_value=min(Value, na.rm = TRUE), max_value=max(Value, na.rm = TRUE), IQR_value=IQR(Value, na.rm = TRUE))
AE_analysis
```
As per the analysis, in each area the we have three kinds of elements and there are a lot of variations and mean value is not dependent to element itself in any area but it differs in each area and we can get a deeper understanding when we analyse the area and element separately.
#### Year and Element wise analysis
```{r}
YE_analysis <- FAOSTAT_cattle_dairy %>%
group_by(Year, Element) %>%
summarise(mean_value=mean(Value, na.rm = TRUE), median_value=median(Value, na.rm = TRUE), sd_value=sd(Value, na.rm = TRUE), min_value=min(Value, na.rm = TRUE), max_value=max(Value, na.rm = TRUE), IQR_value=IQR(Value, na.rm = TRUE))
YE_analysis
```
As per the data, we can say that the elements has been increased by passing year, there are some fluctuations in between but there is an overall increase by the year.
#### Element-wise analysis
```{r}
E_analysis <- FAOSTAT_cattle_dairy %>%
group_by(Element) %>%
summarise(mean_value=mean(Value, na.rm = TRUE), median_value=median(Value, na.rm = TRUE), sd_value=sd(Value, na.rm = TRUE), min_value=min(Value, na.rm = TRUE), max_value=max(Value, na.rm = TRUE), IQR_value=IQR(Value, na.rm = TRUE)) %>%
arrange(desc(mean_value))
E_analysis
```
As per the above observation, the mean value of the production is maximum and yield is the minimum. It shows that the production value is more with the less yield and it leads to an overall profit.
#### Area-wise analysis
```{r}
A_analysis <- FAOSTAT_cattle_dairy %>%
group_by(Area) %>%
summarise(mean_value=mean(Value, na.rm = TRUE), median_value=median(Value, na.rm = TRUE), sd_value=sd(Value, na.rm = TRUE), min_value=min(Value, na.rm = TRUE), max_value=max(Value, na.rm = TRUE), IQR_value=IQR(Value, na.rm = TRUE)) %>%
arrange(desc(mean_value))
A_analysis
```
As per the analysis, Europe has the highest cattle dairy and British Virgin Islands has the minimum, we can say none.
#### Year-wise analysis
```{r}
Y_analysis <- FAOSTAT_cattle_dairy %>%
group_by(Year) %>%
summarise(mean_value=mean(Value, na.rm = TRUE), median_value=median(Value, na.rm = TRUE), sd_value=sd(Value, na.rm = TRUE), min_value=min(Value, na.rm = TRUE), max_value=max(Value, na.rm = TRUE), IQR_value=IQR(Value, na.rm = TRUE)) %>%
arrange(desc(mean_value))
Y_analysis
```
As per the data, the cattle dairy production has been increased by the passing year, there are some fluctuations in middle but the overall it is increasing, we can see from the data that the increase is nearly 85 percentage from 1961 to 2018.
### Explain and Interpret
I chose Element, Area and Year subgroups from data set. The reason for choosing them is there importance in analysis and also tried to analysis using two variables to know the dependency of values on different variables. So I have done Area and Element wise analysis, Year and Element wise analysis, Element-wise analysis, Area-wise analysis and Year-wise analysis. Conclusion to my analysis would be Europe has the highest of the cattle dairy and the production of the dairy is increasing gradually as by the year.