Code
library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Shantanu Patil
March 1, 2023
I have selected faostat_cattle_diary dataset to do my analysis on.
# A tibble: 3 × 14
Domai…¹ Domain Area …² Area Eleme…³ Element Item …⁴ Item Year …⁵ Year Unit
<chr> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl> <chr>
1 QL Lives… 2 Afgh… 5318 Milk A… 882 Milk… 1961 1961 Head
2 QL Lives… 2 Afgh… 5420 Yield 882 Milk… 1961 1961 hg/An
3 QL Lives… 2 Afgh… 5510 Produc… 882 Milk… 1961 1961 tonn…
# … with 3 more variables: Value <dbl>, Flag <chr>, `Flag Description` <chr>,
# and abbreviated variable names ¹`Domain Code`, ²`Area Code`,
# ³`Element Code`, ⁴`Item Code`, ⁵`Year Code`
Using R commands I am exploring the data and coming up with the description from the data. The dataset contains 14 column, 8 of which have character datatype and 6 have integer data type. Here are the Columns
chr (8): Domain Code, Domain, Area, Element, Item, Unit, Flag, Flag Description
dbl (6): Area Code, Element Code, Item Code, Year Code, Year, Value
Here is the size of the Dataset — Rows: 36449 Columns: 14
Domain Code Domain Area Code Area
Length:36449 Length:36449 Min. : 1.0 Length:36449
Class :character Class :character 1st Qu.: 69.0 Class :character
Mode :character Mode :character Median : 141.0 Mode :character
Mean : 775.2
3rd Qu.: 215.0
Max. :5504.0
Element Code Element Item Code Item
Min. :5318 Length:36449 Min. :882 Length:36449
1st Qu.:5318 Class :character 1st Qu.:882 Class :character
Median :5420 Mode :character Median :882 Mode :character
Mean :5416 Mean :882
3rd Qu.:5510 3rd Qu.:882
Max. :5510 Max. :882
Year Code Year Unit Value
Min. :1961 Min. :1961 Length:36449 Min. : 7
1st Qu.:1976 1st Qu.:1976 Class :character 1st Qu.: 7849
Median :1991 Median :1991 Mode :character Median : 43266
Mean :1990 Mean :1990 Mean : 4410235
3rd Qu.:2005 3rd Qu.:2005 3rd Qu.: 700000
Max. :2018 Max. :2018 Max. :683217055
NA's :74
Flag Flag Description
Length:36449 Length:36449
Class :character Class :character
Mode :character Mode :character
Conducting exploratory data analysis, using dplyr commands such as group_by()
, select()
, filter()
, and summarise()
.
Remove columns with “Item Code” values as it has constant value 882
Grouping by the Area and finding the mean_value
# A tibble: 232 × 2
Area mean_value
<chr> <dbl>
1 Afghanistan 903662.
2 Africa 18861019.
3 Albania 275446.
4 Algeria 599219.
5 American Samoa 3611.
6 Americas 56060267.
7 Angola 164004.
8 Antigua and Barbuda 6402.
9 Argentina 3279105.
10 Armenia 276119.
# … with 222 more rows
Table output by selecting Columns “Area”, “Year”, and “Value”.
# A tibble: 36,449 × 3
Area Year Value
<chr> <dbl> <dbl>
1 Afghanistan 1961 700000
2 Afghanistan 1961 5000
3 Afghanistan 1961 350000
4 Afghanistan 1962 700000
5 Afghanistan 1962 5000
6 Afghanistan 1962 350000
7 Afghanistan 1963 780000
8 Afghanistan 1963 5128
9 Afghanistan 1963 400000
10 Afghanistan 1964 780000
# … with 36,439 more rows
I wanted to check out how any data rows / points are available for Antigua and Barbuda in the year of 1975. There are 3 datapoints available.
# A tibble: 3 × 13
Domain …¹ Domain Area …² Area Eleme…³ Element Item Year …⁴ Year Unit Value
<chr> <chr> <dbl> <chr> <dbl> <chr> <chr> <dbl> <dbl> <chr> <dbl>
1 QL Lives… 8 Anti… 5318 Milk A… Milk… 1975 1975 Head 3100
2 QL Lives… 8 Anti… 5420 Yield Milk… 1975 1975 hg/An 10645
3 QL Lives… 8 Anti… 5510 Produc… Milk… 1975 1975 tonn… 3300
# … with 2 more variables: Flag <chr>, `Flag Description` <chr>, and
# abbreviated variable names ¹`Domain Code`, ²`Area Code`, ³`Element Code`,
# ⁴`Year Code`
Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for Value within the data set.
#deleting the rows with Null values
data2 <- na.omit(data)
# Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
data2%>%
group_by(Area)%>%
summarize(mean.Value = mean(Value, na.rm = TRUE), median.Value = median(Value, na.rm = TRUE), min.Value = min(Value, na.rm = TRUE), max.Value = max(Value, na.rm = TRUE), sd.Value = sd(Value, na.rm = TRUE), var.Value = var(Value, na.rm = TRUE), IQR.Value = IQR(Value, na.rm = TRUE))
# A tibble: 231 × 8
Area mean.Va…¹ media…² min.V…³ max.V…⁴ sd.Va…⁵ var.V…⁶ IQR.V…⁷
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan 873703. 5.56e5 3592 4.05e6 1.07e6 1.14e12 1.19e6
2 Africa 18861019. 1.51e7 4311 7.13e7 1.93e7 3.73e14 3.11e7
3 Albania 88480. 2.53e4 6350 3.24e5 9.90e4 9.81e 9 1.52e5
4 Algeria 277666. 1.92e5 6008 1.09e6 2.92e5 8.52e10 5.28e5
5 American Samoa 3652. 3.15e1 16 1.38e4 5.37e3 2.88e 7 7.98e3
6 Americas 56060267. 4.51e7 22342 1.84e8 5.50e7 3.03e15 9.12e7
7 Angola 164317. 1.57e5 4800 4.36e5 1.36e5 1.85e10 2.8 e5
8 Antigua and Barbuda 6414. 6.2 e3 2300 1.06e4 2.67e3 7.15e 6 5.64e3
9 Argentina 1242484. 5.80e4 16980 8.1 e6 1.40e6 1.95e12 2.37e6
10 Armenia 47702. 1.95e4 14937 4.44e5 1.08e5 1.18e10 3.45e3
# … with 221 more rows, and abbreviated variable names ¹mean.Value,
# ²median.Value, ³min.Value, ⁴max.Value, ⁵sd.Value, ⁶var.Value, ⁷IQR.Value
I wanted to check out how any data rows / points are available for Antigua and Barbuda in the year of 1975. There are 3 datapoints available. Some Conclusion from seeing the data summary
I deleted the Item Code Column.
I choose Area (Country) as a group because I wanted to see how does the Value of output of cattle and diary farms vary based on teh country they are located in. I think every country has a set of climatic condition, thus the value in tropical countries might be far greater than countries near north pole, I wanted to test this hypothesis.
I found out the relationship is not linear, on seeing the data I realized that some countries might be the around the same distance from equator or have same climatic conditions but still the mean value out put varies. Ex Afghanistan and Africa may have somewhat same climatic conditions but as afganisatn is located i the mountains the mean Value is much lower compared to Africa.
---
title: "Challenge 2: Finding the Cattle Dairy Value statistics for each country from 1961 - 2018"
author: "Shantanu Patil"
description: "Data wrangling and Data Sumarization to find statistics in data"
date: "03/01/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_2
- faostat
- faostat_cattle_diary
- Shantanu Patil
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Read in the Data
I have selected faostat_cattle_diary dataset to do my analysis on.
```{r}
# Importing the faostat_cattle_diary dataset and reading them
library(readr)
data <-read_csv("_data/FAOSTAT_cattle_dairy.csv")
#Lets see the first 3 rows of the dataset
head(data, 3)
```
## Describe the data
Using R commands I am exploring the data and coming up with the description from the data.
The dataset contains 14 column, 8 of which have character datatype and 6 have integer data type.
Here are the Columns
chr (8): Domain Code, Domain, Area, Element, Item, Unit, Flag, Flag Description
dbl (6): Area Code, Element Code, Item Code, Year Code, Year, Value
Here is the size of the Dataset --- Rows: 36449 Columns: 14
### Checking the variation in data for each column
```{r}
#| label: summary
# Check the variation in data for each column
summary(data)
```
## Provide Grouped Summary Statistics
Conducting exploratory data analysis, using dplyr commands such as `group_by()`, `select()`, `filter()`, and `summarise()`.
Remove columns with "Item Code" values as it has constant value 882
```{r}
#install.packages("dplyr")
library(dplyr)
#remove columns with "Item Code" values as it has contant value 882
data<-data%>%
select(-contains("Item Code"))
```
Grouping by the Area and finding the mean_value
```{r}
data %>%
group_by(Area) %>%
summarise(mean_value = mean(Value))
```
Table output by selecting Columns "Area", "Year", and "Value".
```{r}
data %>%
select(Area, Year, Value)
```
I wanted to check out how any data rows / points are available for Antigua and Barbuda in the year of 1975. There are 3 datapoints available.
```{r}
data %>%
filter(Area == "Antigua and Barbuda" , Year == 1975)
```
## Table Value Statistics
Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for Value within the data set.
```{r}
#deleting the rows with Null values
data2 <- na.omit(data)
# Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
data2%>%
group_by(Area)%>%
summarize(mean.Value = mean(Value, na.rm = TRUE), median.Value = median(Value, na.rm = TRUE), min.Value = min(Value, na.rm = TRUE), max.Value = max(Value, na.rm = TRUE), sd.Value = sd(Value, na.rm = TRUE), var.Value = var(Value, na.rm = TRUE), IQR.Value = IQR(Value, na.rm = TRUE))
```
I wanted to check out how any data rows / points are available for Antigua and Barbuda in the year of 1975. There are 3 datapoints available.
Some Conclusion from seeing the data summary
1. The Data is Collected from the year 1961 - 2018
2. The Item Code column just has one value 882
3. The Area Code varies from 1 - 5504
4. The Element Code varies frojm 5318 - 5510
5. The Value Column has a lot of variance
I deleted the Item Code Column.
I choose Area (Country) as a group because I wanted to see how does the Value of output of cattle and diary farms vary based on teh country they are located in. I think every country has a set of climatic condition, thus the value in tropical countries might be far greater than countries near north pole, I wanted to test this hypothesis.
I found out the relationship is not linear, on seeing the data I realized that some countries might be the around the same distance from equator or have same climatic conditions but still the mean value out put varies. Ex Afghanistan and Africa may have somewhat same climatic conditions but as afganisatn is located i the mountains the mean Value is much lower compared to Africa.