Code
library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
FNU Avinesh Krishnan
May 15, 2023
Today’s challenge is to
Read in one (or more) of the following data sets, available in the posts/_data
folder, using the correct R package and command.
Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.
Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).
The dataset consists of information related to live animals in different countries. It includes the following cases and variables:
Each row represents a specific case, providing details about live animal stocks in a particular country for a specific year. Variables: Domain Code: Code representing the domain of the data (e.g., QA for live animals). Domain: Domain of the data (e.g., Live Animals). Area Code: Code representing the specific area or country (e.g., 2 for Afghanistan). Area: Name of the area or country (e.g., Afghanistan). Element Code: Code representing the element of the data (e.g., 5111 for stocks). Element: Description of the element (e.g., Stocks). Item Code: Code representing the specific item or category (e.g., 1107 for asses). Item: Description of the item (e.g., Asses). Year Code: Code representing the year of the data (e.g., 1961). Year: Year of the data (e.g., 1961). Unit: Unit of measurement (e.g., Head). Value: Numeric value representing the quantity of live animals. Flag: Flag indicating additional information or data status (e.g., NA for missing data). Flag Description: Description of the flag (e.g., Official data or FAO estimate). The data appears to be gathered from official sources, likely collected by the Food and Agriculture Organization (FAO) or a related organization. It provides insights into the stock levels of various live animals in different countries over multiple years.
# A tibble: 6 × 14
Domai…¹ Domain Area …² Area Eleme…³ Element Item …⁴ Item Year …⁵ Year Unit
<chr> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl> <chr>
1 QA Live … 2 Afgh… 5111 Stocks 1107 Asses 1961 1961 Head
2 QA Live … 2 Afgh… 5111 Stocks 1107 Asses 1962 1962 Head
3 QA Live … 2 Afgh… 5111 Stocks 1107 Asses 1963 1963 Head
4 QA Live … 2 Afgh… 5111 Stocks 1107 Asses 1964 1964 Head
5 QA Live … 2 Afgh… 5111 Stocks 1107 Asses 1965 1965 Head
6 QA Live … 2 Afgh… 5111 Stocks 1107 Asses 1966 1966 Head
# … with 3 more variables: Value <dbl>, Flag <chr>, `Flag Description` <chr>,
# and abbreviated variable names ¹`Domain Code`, ²`Area Code`,
# ³`Element Code`, ⁴`Item Code`, ⁵`Year Code`
spc_tbl_ [82,116 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ Domain Code : chr [1:82116] "QA" "QA" "QA" "QA" ...
$ Domain : chr [1:82116] "Live Animals" "Live Animals" "Live Animals" "Live Animals" ...
$ Area Code : num [1:82116] 2 2 2 2 2 2 2 2 2 2 ...
$ Area : chr [1:82116] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
$ Element Code : num [1:82116] 5111 5111 5111 5111 5111 ...
$ Element : chr [1:82116] "Stocks" "Stocks" "Stocks" "Stocks" ...
$ Item Code : num [1:82116] 1107 1107 1107 1107 1107 ...
$ Item : chr [1:82116] "Asses" "Asses" "Asses" "Asses" ...
$ Year Code : num [1:82116] 1961 1962 1963 1964 1965 ...
$ Year : num [1:82116] 1961 1962 1963 1964 1965 ...
$ Unit : chr [1:82116] "Head" "Head" "Head" "Head" ...
$ Value : num [1:82116] 1300000 851850 1001112 1150000 1300000 ...
$ Flag : chr [1:82116] NA NA NA "F" ...
$ Flag Description: chr [1:82116] "Official data" "Official data" "Official data" "FAO estimate" ...
- attr(*, "spec")=
.. cols(
.. `Domain Code` = col_character(),
.. Domain = col_character(),
.. `Area Code` = col_double(),
.. Area = col_character(),
.. `Element Code` = col_double(),
.. Element = col_character(),
.. `Item Code` = col_double(),
.. Item = col_character(),
.. `Year Code` = col_double(),
.. Year = col_double(),
.. Unit = col_character(),
.. Value = col_double(),
.. Flag = col_character(),
.. `Flag Description` = col_character()
.. )
- attr(*, "problems")=<externalptr>
[1] "Domain Code" "Domain" "Area Code" "Area"
[5] "Element Code" "Element" "Item Code" "Item"
[9] "Year Code" "Year" "Unit" "Value"
[13] "Flag" "Flag Description"
# A tibble: 1 × 1
Domain
<chr>
1 Live Animals
# A tibble: 1 × 1
`Element Code`
<dbl>
1 5111
Domain Code Domain Area Code Area
Length:82116 Length:82116 Min. : 1.0 Length:82116
Class :character Class :character 1st Qu.: 73.0 Class :character
Mode :character Mode :character Median : 146.0 Mode :character
Mean : 912.7
3rd Qu.: 221.0
Max. :5504.0
Element Code Element Item Code Item
Min. :5111 Length:82116 Min. : 866 Length:82116
1st Qu.:5111 Class :character 1st Qu.: 976 Class :character
Median :5111 Mode :character Median :1034 Mode :character
Mean :5111 Mean :1018
3rd Qu.:5111 3rd Qu.:1096
Max. :5111 Max. :1126
Year Code Year Unit Value
Min. :1961 Min. :1961 Length:82116 Min. :0.000e+00
1st Qu.:1976 1st Qu.:1976 Class :character 1st Qu.:1.250e+04
Median :1991 Median :1991 Mode :character Median :2.247e+05
Mean :1990 Mean :1990 Mean :1.163e+07
3rd Qu.:2005 3rd Qu.:2005 3rd Qu.:2.377e+06
Max. :2018 Max. :2018 Max. :1.490e+09
NA's :1301
Flag Flag Description
Length:82116 Length:82116
Class :character Class :character
Mode :character Mode :character
Conduct some exploratory data analysis, using dplyr commands such as group_by()
, select()
, filter()
, and summarise()
. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
# A tibble: 253 × 4
Area averge_value Max_value Min_value
<chr> <dbl> <dbl> <dbl>
1 Afghanistan 3597216. 21500000 20000
2 Africa 78159910. 438110974 706338
3 Albania NA NA NA
4 Algeria 2575444. 28693330 2000
5 American Samoa 5399. 14600 88
6 Americas 95795716. 522867113 66000
7 Angola 1123117. 5044019 800
8 Antigua and Barbuda 7877. 36000 400
9 Argentina 12879607. 61053808 85198
10 Armenia 167587. 1000000 25
# … with 243 more rows
# A tibble: 2 × 14
Domai…¹ Domain Area …² Area Eleme…³ Element Item …⁴ Item Year …⁵ Year Unit
<chr> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl> <chr>
1 QA Live … 2 Afgh… 5111 Stocks 1126 Came… 1961 1961 Head
2 QA Live … 5100 Afri… 5111 Stocks 1126 Came… 1961 1961 Head
# … with 3 more variables: Value <dbl>, Flag <chr>, `Flag Description` <chr>,
# and abbreviated variable names ¹`Domain Code`, ²`Area Code`,
# ³`Element Code`, ⁴`Item Code`, ⁵`Year Code`
There are two data points available for the above filter i.e for two areas.
Summary_Year <- na.omit(livestockdata) %>%
group_by(Year) %>%
summarize(
mean_Value = mean(Value, na.rm = TRUE),
median_Value = median(Value, na.rm = TRUE),
min_Value = min(Value, na.rm = TRUE),
max_Value = max(Value, na.rm = TRUE),
sd_Value = sd(Value, na.rm = TRUE),
var_Value = var(Value, na.rm = TRUE),
IQR_Value = IQR(Value, na.rm = TRUE)
)
Summary_Year
# A tibble: 58 × 8
Year mean_Value median_Value min_Value max_Value sd_Value var_Va…¹ IQR_V…²
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1961 14080955. 129000 10 994268736 66803024. 4.46e15 2.14e6
2 1962 14085663. 130000 10 997193122 66008462. 4.36e15 2.01e6
3 1963 14568714. 154000 10 999696719 67268611. 4.53e15 2.49e6
4 1964 15101322. 160000 10 1013486149 68974136. 4.76e15 2.61e6
5 1965 15558689. 140000 10 1030878735 70820667. 5.02e15 2.75e6
6 1966 15481029. 155000 10 1040688491 72204592. 5.21e15 2.42e6
7 1967 15804831. 148000 10 1059154631 72456707. 5.25e15 2.54e6
8 1968 15526705. 105850 10 1074022618 72343610. 5.23e15 2.25e6
9 1969 15203756. 116412. 12 1080720926 71629580. 5.13e15 2.15e6
10 1970 15475993. 103000 12 1081641464 72217288. 5.22e15 2.24e6
# … with 48 more rows, and abbreviated variable names ¹var_Value, ²IQR_Value
Summary_Area <- na.omit(livestockdata) %>%
group_by(Area) %>%
summarize(
mean_Value = mean(Value, na.rm = TRUE),
median_Value = median(Value, na.rm = TRUE),
min_Value = min(Value, na.rm = TRUE),
max_Value = max(Value, na.rm = TRUE),
sd_Value = sd(Value, na.rm = TRUE),
var_Value = var(Value, na.rm = TRUE),
IQR_Value = IQR(Value, na.rm = TRUE)
)
Summary_Area
# A tibble: 240 × 8
Area mean_Va…¹ media…² min_V…³ max_V…⁴ sd_Va…⁵ var_V…⁶ IQR_V…⁷
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan 2571604. 4.1 e5 20000 1.96e7 4.69e6 2.20e13 2.46e6
2 Africa 78159910. 1.45e7 706338 4.38e8 1.08e8 1.17e16 1.49e8
3 Albania 181722. 5.5 e4 93 1.35e6 3.24e5 1.05e11 9.35e4
4 Algeria 22361. 5 e3 2000 7 e5 9.26e4 8.58e 9 1.5 e3
5 American Samoa 5858. 9.6 e3 95 1.46e4 5.35e3 2.86e 7 1.04e4
6 Americas 95795716. 3.22e7 66000 5.23e8 1.39e8 1.93e16 1.26e8
7 Angola 867412. 2.4 e5 800 5.04e6 1.28e6 1.63e12 1.17e6
8 Antigua and Barbuda 7426. 3 e3 400 3.6 e4 8.42e3 7.09e 7 1.1 e4
9 Argentina 3685037. 2.4 e6 88000 4.9 e7 8.14e6 6.63e13 3.52e6
10 Armenia 330230. 3.83e4 6415 1 e6 4.11e5 1.69e11 5.95e5
# … with 230 more rows, and abbreviated variable names ¹mean_Value,
# ²median_Value, ³min_Value, ⁴max_Value, ⁵sd_Value, ⁶var_Value, ⁷IQR_Value
The data spans a period from 1961 to 2018, indicating a long-term collection of data. The Element column contains only one distinct value, which is “Stocks”. Similarly, the Element Code also has a single value, which is 5111. This suggests that the dataset focuses specifically on the stocks element. The Area Code ranges from 1 to 5504, indicating a wide range of areas or regions included in the dataset. The Value column exhibits a significant amount of variance, suggesting diverse numerical values for the specific element being measured.
Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.
---
title: "Challenge 2"
author: "FNU Avinesh Krishnan"
description: "Data wrangling: using group() and summarise()"
date: "05/15/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_2
- FNU Avinesh Krishnan
- FAOSTAT_livestock
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to
1) read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
2) provide summary statistics for different interesting groups within the data, and interpret those statistics
## Read in the Data
Read in one (or more) of the following data sets, available in the `posts/_data` folder, using the correct R package and command.
- railroad\*.csv or StateCounty2012.xls ⭐
- FAOstat\*.csv or birds.csv ⭐⭐⭐
- hotel_bookings.csv ⭐⭐⭐⭐
```{r}
livestockdata <-read_csv("~/Desktop/601_Spring_2023/posts/_data/FAOSTAT_livestock.csv")
view(livestockdata)
```
Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.
## Describe the data
Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).
The dataset consists of information related to live animals in different countries. It includes the following cases and variables:
Each row represents a specific case, providing details about live animal stocks in a particular country for a specific year.
Variables:
Domain Code: Code representing the domain of the data (e.g., QA for live animals).
Domain: Domain of the data (e.g., Live Animals).
Area Code: Code representing the specific area or country (e.g., 2 for Afghanistan).
Area: Name of the area or country (e.g., Afghanistan).
Element Code: Code representing the element of the data (e.g., 5111 for stocks).
Element: Description of the element (e.g., Stocks).
Item Code: Code representing the specific item or category (e.g., 1107 for asses).
Item: Description of the item (e.g., Asses).
Year Code: Code representing the year of the data (e.g., 1961).
Year: Year of the data (e.g., 1961).
Unit: Unit of measurement (e.g., Head).
Value: Numeric value representing the quantity of live animals.
Flag: Flag indicating additional information or data status (e.g., NA for missing data).
Flag Description: Description of the flag (e.g., Official data or FAO estimate).
The data appears to be gathered from official sources, likely collected by the Food and Agriculture Organization (FAO) or a related organization. It provides insights into the stock levels of various live animals in different countries over multiple years.
```{r}
#| label: summary
head(livestockdata)
```
```{r}
str(livestockdata)
```
```{r}
colnames(livestockdata)
```
```{r}
livestockdata%>%
select("Domain") %>%
distinct(.)
```
```{r}
livestockdata%>%
select("Element Code") %>%
distinct(.)
```
```{r}
dim(livestockdata)
```
```{r}
min(livestockdata$Year)
max(livestockdata$Year)
```
```{r}
min(livestockdata$`Area Code`)
max(livestockdata$`Area Code`)
```
```{r}
summary(livestockdata)
```
## Provide Grouped Summary Statistics
Conduct some exploratory data analysis, using dplyr commands such as `group_by()`, `select()`, `filter()`, and `summarise()`. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
```{r}
livestockdata %>%
group_by(Area) %>%
summarise(averge_value = mean(Value), Max_value=max(Value), Min_value=min(Value))
```
```{r}
livestockdata %>%
filter(Area == "Africa" | Area == "Afghanistan",Item == "Camels", Year == 1961)
```
There are two data points available for the above filter i.e for two areas.
```{r}
Summary_Year <- na.omit(livestockdata) %>%
group_by(Year) %>%
summarize(
mean_Value = mean(Value, na.rm = TRUE),
median_Value = median(Value, na.rm = TRUE),
min_Value = min(Value, na.rm = TRUE),
max_Value = max(Value, na.rm = TRUE),
sd_Value = sd(Value, na.rm = TRUE),
var_Value = var(Value, na.rm = TRUE),
IQR_Value = IQR(Value, na.rm = TRUE)
)
Summary_Year
```
```{r}
Summary_Area <- na.omit(livestockdata) %>%
group_by(Area) %>%
summarize(
mean_Value = mean(Value, na.rm = TRUE),
median_Value = median(Value, na.rm = TRUE),
min_Value = min(Value, na.rm = TRUE),
max_Value = max(Value, na.rm = TRUE),
sd_Value = sd(Value, na.rm = TRUE),
var_Value = var(Value, na.rm = TRUE),
IQR_Value = IQR(Value, na.rm = TRUE)
)
Summary_Area
```
The data spans a period from 1961 to 2018, indicating a long-term collection of data. The Element column contains only one distinct value, which is "Stocks". Similarly, the Element Code also has a single value, which is 5111. This suggests that the dataset focuses specifically on the stocks element. The Area Code ranges from 1 to 5504, indicating a wide range of areas or regions included in the dataset. The Value column exhibits a significant amount of variance, suggesting diverse numerical values for the specific element being measured.
### Explain and Interpret
Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.