Code
library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Jaswanth Reddy Kommuru
May 8, 2023
Today’s challenge is to
Read in one (or more) of the following data sets, available in the posts/_data
folder, using the correct R package and command.
Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.
Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).
The data appears to be related to live animal stocks, specifically chickens, in various countries over multiple years. Here’s a high-level description of the data:
The dataset consists of cases representing different observations and variables providing specific information about those observations. Each case represents a particular combination of country, year, and other attributes. The variables in the dataset are as follows:
Domain Code: Code indicating the domain or category of the data. Domain: Descriptive label for the domain. Area Code: Code representing the country or area. Area: Name of the country or area. Element Code: Code indicating the element or attribute being measured. Element: Description of the element or attribute. Item Code: Code representing the specific item or category being measured. Item: Description of the item or category. Year Code: Code representing the year of observation. Year: Specific year of observation. Unit: Unit of measurement for the values. Value: Numerical value or measurement for the specific observation. Flag: Flag indicating any additional information or peculiarities about the observation. Flag Description: Description providing further details about the flag.
It seems that the data was likely gathered by the Food and Agriculture Organization (FAO) as estimates of live animal stocks, focusing on chickens. The observations include the country, year, and the stock of chickens measured in units of 1000 head. The dataset also includes flags that might indicate specific circumstances or notes related to the observations.
# A tibble: 6 × 14
`Domain Code` Domain `Area Code` Area `Element Code` Element `Item Code`
<chr> <chr> <dbl> <chr> <dbl> <chr> <dbl>
1 QA Live Anima… 2 Afgh… 5112 Stocks 1057
2 QA Live Anima… 2 Afgh… 5112 Stocks 1057
3 QA Live Anima… 2 Afgh… 5112 Stocks 1057
4 QA Live Anima… 2 Afgh… 5112 Stocks 1057
5 QA Live Anima… 2 Afgh… 5112 Stocks 1057
6 QA Live Anima… 2 Afgh… 5112 Stocks 1057
# ℹ 7 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
# Value <dbl>, Flag <chr>, `Flag Description` <chr>
# A tibble: 6 × 14
`Domain Code` Domain `Area Code` Area `Element Code` Element `Item Code`
<chr> <chr> <dbl> <chr> <dbl> <chr> <dbl>
1 QA Live Anima… 5504 Poly… 5112 Stocks 1068
2 QA Live Anima… 5504 Poly… 5112 Stocks 1068
3 QA Live Anima… 5504 Poly… 5112 Stocks 1068
4 QA Live Anima… 5504 Poly… 5112 Stocks 1068
5 QA Live Anima… 5504 Poly… 5112 Stocks 1068
6 QA Live Anima… 5504 Poly… 5112 Stocks 1068
# ℹ 7 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
# Value <dbl>, Flag <chr>, `Flag Description` <chr>
[1] "Domain Code" "Domain" "Area Code" "Area"
[5] "Element Code" "Element" "Item Code" "Item"
[9] "Year Code" "Year" "Unit" "Value"
[13] "Flag" "Flag Description"
spc_tbl_ [30,977 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ Domain Code : chr [1:30977] "QA" "QA" "QA" "QA" ...
$ Domain : chr [1:30977] "Live Animals" "Live Animals" "Live Animals" "Live Animals" ...
$ Area Code : num [1:30977] 2 2 2 2 2 2 2 2 2 2 ...
$ Area : chr [1:30977] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
$ Element Code : num [1:30977] 5112 5112 5112 5112 5112 ...
$ Element : chr [1:30977] "Stocks" "Stocks" "Stocks" "Stocks" ...
$ Item Code : num [1:30977] 1057 1057 1057 1057 1057 ...
$ Item : chr [1:30977] "Chickens" "Chickens" "Chickens" "Chickens" ...
$ Year Code : num [1:30977] 1961 1962 1963 1964 1965 ...
$ Year : num [1:30977] 1961 1962 1963 1964 1965 ...
$ Unit : chr [1:30977] "1000 Head" "1000 Head" "1000 Head" "1000 Head" ...
$ Value : num [1:30977] 4700 4900 5000 5300 5500 5800 6600 6290 6300 6000 ...
$ Flag : chr [1:30977] "F" "F" "F" "F" ...
$ Flag Description: chr [1:30977] "FAO estimate" "FAO estimate" "FAO estimate" "FAO estimate" ...
- attr(*, "spec")=
.. cols(
.. `Domain Code` = col_character(),
.. Domain = col_character(),
.. `Area Code` = col_double(),
.. Area = col_character(),
.. `Element Code` = col_double(),
.. Element = col_character(),
.. `Item Code` = col_double(),
.. Item = col_character(),
.. `Year Code` = col_double(),
.. Year = col_double(),
.. Unit = col_character(),
.. Value = col_double(),
.. Flag = col_character(),
.. `Flag Description` = col_character()
.. )
- attr(*, "problems")=<externalptr>
Domain Code Domain Area Code Area
Length:30977 Length:30977 Min. : 1 Length:30977
Class :character Class :character 1st Qu.: 79 Class :character
Mode :character Mode :character Median : 156 Mode :character
Mean :1202
3rd Qu.: 231
Max. :5504
Element Code Element Item Code Item
Min. :5112 Length:30977 Min. :1057 Length:30977
1st Qu.:5112 Class :character 1st Qu.:1057 Class :character
Median :5112 Mode :character Median :1068 Mode :character
Mean :5112 Mean :1066
3rd Qu.:5112 3rd Qu.:1072
Max. :5112 Max. :1083
Year Code Year Unit Value
Min. :1961 Min. :1961 Length:30977 Min. : 0
1st Qu.:1976 1st Qu.:1976 Class :character 1st Qu.: 171
Median :1992 Median :1992 Mode :character Median : 1800
Mean :1991 Mean :1991 Mean : 99411
3rd Qu.:2005 3rd Qu.:2005 3rd Qu.: 15404
Max. :2018 Max. :2018 Max. :23707134
NA's :1036
Flag Flag Description
Length:30977 Length:30977
Class :character Class :character
Mode :character Mode :character
Conduct some exploratory data analysis, using dplyr commands such as group_by()
, select()
, filter()
, and summarise()
. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
# A tibble: 58 × 2
Year Total_sum
<dbl> <dbl>
1 2018 102772478
2 2017 100702358
3 2016 99205812
4 2015 94430088
5 2014 92055902
6 2013 91572785
7 2012 89857937
8 2010 89244853
9 2011 87518486
10 2009 86817163
# ℹ 48 more rows
Population of the live animals in the available data of all the countries in different years and we can observe from the above table that as the years are increasing generally the count of the live animals is also increasing.
# A tibble: 5 × 3
Item average_value Medain_value
<chr> <dbl> <dbl>
1 Chickens 207931. 10784.
2 Ducks 23072. 510
3 Geese and guinea fowls 10292. 258
4 Pigeons, other birds 6163. 2800
5 Turkeys 15228. 528
Item_group <-
birdsdata %>%
group_by(`Item`) %>%
select(Item,Value) %>%
summarise(mean_stock_value = mean(Value, na.rm=TRUE),
median_stock_value = median(Value, na.rm=TRUE),
stock_value_sd = sd(Value, na.rm=TRUE),
min_stock_value = min(Value, na.rm=TRUE),
max_stock_value = max(Value, na.rm=TRUE),
first_quartile_stock_value = quantile(Value, 0.25, na.rm=TRUE),
third_quartile_stock_value = quantile(Value, 0.75, na.rm=TRUE))
Item_group
# A tibble: 5 × 8
Item mean_stock_value median_stock_value stock_value_sd min_stock_value
<chr> <dbl> <dbl> <dbl> <dbl>
1 Chickens 207931. 10784. 1081629. 0
2 Ducks 23072. 510 110621. 0
3 Geese and … 10292. 258 44489. 0
4 Pigeons, o… 6163. 2800 8481. 0
5 Turkeys 15228. 528 56416. 0
# ℹ 3 more variables: max_stock_value <dbl>, first_quartile_stock_value <dbl>,
# third_quartile_stock_value <dbl>
Area_group <-
birdsdata %>%
group_by(`Area`) %>%
summarise(mean_stock_value = mean(Value, na.rm=TRUE),
median_stock_value = median(Value, na.rm=TRUE),
stock_value_sd = sd(Value, na.rm=TRUE),
min_stock_value = min(Value, na.rm=TRUE),
max_stock_value = max(Value, na.rm=TRUE),
first_quartile_stock_value = quantile(Value, 0.25, na.rm=TRUE),
third_quartile_stock_value = quantile(Value, 0.75, na.rm=TRUE))
Area_group
# A tibble: 248 × 8
Area mean_stock_value median_stock_value stock_value_sd min_stock_value
<chr> <dbl> <dbl> <dbl> <dbl>
1 Afghanist… 8099. 6700 2819. 4700
2 Africa 196561. 12910. 435740. 1213
3 Albania 2278. 1300 2268. 200
4 Algeria 17621. 42.5 38830. 10
5 American … 41.4 38 14.0 24
6 Americas 856356. 66924. 1539316. 553
7 Angola 9453. 6075 8928. 3400
8 Antigua a… 93.6 85 39.7 43
9 Argentina 18844. 2355 33620. 75
10 Armenia 2062. 1528. 2044. 120
# ℹ 238 more rows
# ℹ 3 more variables: max_stock_value <dbl>, first_quartile_stock_value <dbl>,
# third_quartile_stock_value <dbl>
Year_group <- birdsdata %>%
group_by(`Year`) %>%
summarise(mean_stock_value = mean(Value, na.rm=TRUE),
median_stock_value = median(Value, na.rm=TRUE),
stock_value_sd = sd(Value, na.rm=TRUE),
min_stock_value = min(Value, na.rm=TRUE),
max_stock_value = max(Value, na.rm=TRUE),
first_quartile_stock_value = quantile(Value, 0.25, na.rm=TRUE),
third_quartile_stock_value = quantile(Value, 0.75, na.rm=TRUE))
print(Year_group)
# A tibble: 58 × 8
Year mean_stock_value median_stock_value stock_value_sd min_stock_value
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1961 36752. 1033 216931. 1
2 1962 37787. 1014 224935. 1
3 1963 38736. 1106 230985. 1
4 1964 39325. 1103 234108. 0
5 1965 40334. 1104 240537. 0
6 1966 41229. 1088. 245576. 0
7 1967 43240. 1193 257592. 0
8 1968 44420. 1252. 265750. 0
9 1969 45607. 1267 273871. 0
10 1970 47706. 1259 285751. 0
# ℹ 48 more rows
# ℹ 3 more variables: max_stock_value <dbl>, first_quartile_stock_value <dbl>,
# third_quartile_stock_value <dbl>
Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.
Our objective was to analyze the sizes of the five livestock categories within the dataset and determine the poultry quantities in different countries within the specified time period. Through a specific grouping approach, we observed a notable disparity between the presence of chickens and other types of poultry in these countries. It became evident that the countries had significantly higher numbers of chicken livestock compared to other poultry varieties.
# A tibble: 248 × 2
Area mean_stock_value
<chr> <dbl>
1 Afghanistan 8099.
2 Africa 196561.
3 Albania 2278.
4 Algeria 17621.
5 American Samoa 41.4
6 Americas 856356.
7 Angola 9453.
8 Antigua and Barbuda 93.6
9 Argentina 18844.
10 Armenia 2062.
# ℹ 238 more rows
---
title: "Challenge 2"
author: "Jaswanth Reddy Kommuru"
description: "Data wrangling: using group() and summarise()"
date: "05/08/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_2
- Jaswanth Reddy Kommuru
- birds
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to
1) read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
2) provide summary statistics for different interesting groups within the data, and interpret those statistics
## Read in the Data
Read in one (or more) of the following data sets, available in the `posts/_data` folder, using the correct R package and command.
- railroad\*.csv or StateCounty2012.xls ⭐
- FAOstat\*.csv or birds.csv ⭐⭐⭐
- hotel_bookings.csv ⭐⭐⭐⭐
```{r}
birdsdata<-read_csv("~/Documents/601/601_Spring_2023/posts/_data/birds.csv")
```
Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.
## Describe the data
Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).
The data appears to be related to live animal stocks, specifically chickens, in various countries over multiple years. Here's a high-level description of the data:
The dataset consists of cases representing different observations and variables providing specific information about those observations. Each case represents a particular combination of country, year, and other attributes. The variables in the dataset are as follows:
Domain Code: Code indicating the domain or category of the data.
Domain: Descriptive label for the domain.
Area Code: Code representing the country or area.
Area: Name of the country or area.
Element Code: Code indicating the element or attribute being measured.
Element: Description of the element or attribute.
Item Code: Code representing the specific item or category being measured.
Item: Description of the item or category.
Year Code: Code representing the year of observation.
Year: Specific year of observation.
Unit: Unit of measurement for the values.
Value: Numerical value or measurement for the specific observation.
Flag: Flag indicating any additional information or peculiarities about the observation.
Flag Description: Description providing further details about the flag.
It seems that the data was likely gathered by the Food and Agriculture Organization (FAO) as estimates of live animal stocks, focusing on chickens. The observations include the country, year, and the stock of chickens measured in units of 1000 head. The dataset also includes flags that might indicate specific circumstances or notes related to the observations.
```{r}
#| label: summary
head(birdsdata)
```
```{r}
dim(birdsdata)
```
```{r}
tail(birdsdata)
```
```{r}
colnames(birdsdata)
```
```{r}
nrow(birdsdata)
```
```{r}
str(birdsdata)
```
```{r}
summary(birdsdata)
```
## Provide Grouped Summary Statistics
Conduct some exploratory data analysis, using dplyr commands such as `group_by()`, `select()`, `filter()`, and `summarise()`. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
```{r}
birdsdata%>%
select(`Area`) %>%
n_distinct(.)
```
```{r}
birdsdata%>%
select(`Year`) %>%
n_distinct(.)
```
```{r}
distinct(birdsdata,`Domain`)
```
```{r}
req_data<-select(birdsdata,Area,Item,Year,Value)
req_data %>%
group_by(Year) %>%
summarise(Total_sum=sum(Value,na.rm=TRUE)) %>%
arrange(desc(`Total_sum`))
```
Population of the live animals in the available data of all the countries in different years and we can observe from the above table that as the years are increasing generally the count of the live animals is also increasing.
```{r}
birdsdata %>%
group_by(Item)%>%
summarise(average_value = mean(Value, na.rm = TRUE), Medain_value=median(Value, na.rm = TRUE))
```
```{r}
Item_group <-
birdsdata %>%
group_by(`Item`) %>%
select(Item,Value) %>%
summarise(mean_stock_value = mean(Value, na.rm=TRUE),
median_stock_value = median(Value, na.rm=TRUE),
stock_value_sd = sd(Value, na.rm=TRUE),
min_stock_value = min(Value, na.rm=TRUE),
max_stock_value = max(Value, na.rm=TRUE),
first_quartile_stock_value = quantile(Value, 0.25, na.rm=TRUE),
third_quartile_stock_value = quantile(Value, 0.75, na.rm=TRUE))
Item_group
```
```{r}
Area_group <-
birdsdata %>%
group_by(`Area`) %>%
summarise(mean_stock_value = mean(Value, na.rm=TRUE),
median_stock_value = median(Value, na.rm=TRUE),
stock_value_sd = sd(Value, na.rm=TRUE),
min_stock_value = min(Value, na.rm=TRUE),
max_stock_value = max(Value, na.rm=TRUE),
first_quartile_stock_value = quantile(Value, 0.25, na.rm=TRUE),
third_quartile_stock_value = quantile(Value, 0.75, na.rm=TRUE))
Area_group
```
```{r}
Year_group <- birdsdata %>%
group_by(`Year`) %>%
summarise(mean_stock_value = mean(Value, na.rm=TRUE),
median_stock_value = median(Value, na.rm=TRUE),
stock_value_sd = sd(Value, na.rm=TRUE),
min_stock_value = min(Value, na.rm=TRUE),
max_stock_value = max(Value, na.rm=TRUE),
first_quartile_stock_value = quantile(Value, 0.25, na.rm=TRUE),
third_quartile_stock_value = quantile(Value, 0.75, na.rm=TRUE))
print(Year_group)
```
### Explain and Interpret
Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.
Our objective was to analyze the sizes of the five livestock categories within the dataset and determine the poultry quantities in different countries within the specified time period. Through a specific grouping approach, we observed a notable disparity between the presence of chickens and other types of poultry in these countries. It became evident that the countries had significantly higher numbers of chicken livestock compared to other poultry varieties.
```{r}
birdsdata %>%
group_by(Area)%>%
summarize(mean_stock_value = mean(Value, na.rm = TRUE))
```