Code
library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Anirudh Lakkaraju
April 30, 2023
Today’s challenge is to
Read in one (or more) of the following data sets, available in the posts/_data
folder, using the correct R package and command.
# A tibble: 30,977 × 14
`Domain Code` Domain `Area Code` Area `Element Code` Element `Item Code`
<chr> <chr> <dbl> <chr> <dbl> <chr> <dbl>
1 QA Live Anim… 2 Afgh… 5112 Stocks 1057
2 QA Live Anim… 2 Afgh… 5112 Stocks 1057
3 QA Live Anim… 2 Afgh… 5112 Stocks 1057
4 QA Live Anim… 2 Afgh… 5112 Stocks 1057
5 QA Live Anim… 2 Afgh… 5112 Stocks 1057
6 QA Live Anim… 2 Afgh… 5112 Stocks 1057
7 QA Live Anim… 2 Afgh… 5112 Stocks 1057
8 QA Live Anim… 2 Afgh… 5112 Stocks 1057
9 QA Live Anim… 2 Afgh… 5112 Stocks 1057
10 QA Live Anim… 2 Afgh… 5112 Stocks 1057
# ℹ 30,967 more rows
# ℹ 7 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
# Value <dbl>, Flag <chr>, `Flag Description` <chr>
# A tibble: 6 × 14
`Domain Code` Domain `Area Code` Area `Element Code` Element `Item Code`
<chr> <chr> <dbl> <chr> <dbl> <chr> <dbl>
1 QA Live Anima… 2 Afgh… 5112 Stocks 1057
2 QA Live Anima… 2 Afgh… 5112 Stocks 1057
3 QA Live Anima… 2 Afgh… 5112 Stocks 1057
4 QA Live Anima… 2 Afgh… 5112 Stocks 1057
5 QA Live Anima… 2 Afgh… 5112 Stocks 1057
6 QA Live Anima… 2 Afgh… 5112 Stocks 1057
# ℹ 7 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
# Value <dbl>, Flag <chr>, `Flag Description` <chr>
# A tibble: 6 × 14
`Domain Code` Domain `Area Code` Area `Element Code` Element `Item Code`
<chr> <chr> <dbl> <chr> <dbl> <chr> <dbl>
1 QA Live Anima… 5504 Poly… 5112 Stocks 1068
2 QA Live Anima… 5504 Poly… 5112 Stocks 1068
3 QA Live Anima… 5504 Poly… 5112 Stocks 1068
4 QA Live Anima… 5504 Poly… 5112 Stocks 1068
5 QA Live Anima… 5504 Poly… 5112 Stocks 1068
6 QA Live Anima… 5504 Poly… 5112 Stocks 1068
# ℹ 7 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
# Value <dbl>, Flag <chr>, `Flag Description` <chr>
Domain Code Domain Area Code Area
Length:30977 Length:30977 Min. : 1 Length:30977
Class :character Class :character 1st Qu.: 79 Class :character
Mode :character Mode :character Median : 156 Mode :character
Mean :1202
3rd Qu.: 231
Max. :5504
Element Code Element Item Code Item
Min. :5112 Length:30977 Min. :1057 Length:30977
1st Qu.:5112 Class :character 1st Qu.:1057 Class :character
Median :5112 Mode :character Median :1068 Mode :character
Mean :5112 Mean :1066
3rd Qu.:5112 3rd Qu.:1072
Max. :5112 Max. :1083
Year Code Year Unit Value
Min. :1961 Min. :1961 Length:30977 Min. : 0
1st Qu.:1976 1st Qu.:1976 Class :character 1st Qu.: 171
Median :1992 Median :1992 Mode :character Median : 1800
Mean :1991 Mean :1991 Mean : 99411
3rd Qu.:2005 3rd Qu.:2005 3rd Qu.: 15404
Max. :2018 Max. :2018 Max. :23707134
NA's :1036
Flag Flag Description
Length:30977 Length:30977
Class :character Class :character
Mode :character Mode :character
spc_tbl_ [30,977 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ Domain Code : chr [1:30977] "QA" "QA" "QA" "QA" ...
$ Domain : chr [1:30977] "Live Animals" "Live Animals" "Live Animals" "Live Animals" ...
$ Area Code : num [1:30977] 2 2 2 2 2 2 2 2 2 2 ...
$ Area : chr [1:30977] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
$ Element Code : num [1:30977] 5112 5112 5112 5112 5112 ...
$ Element : chr [1:30977] "Stocks" "Stocks" "Stocks" "Stocks" ...
$ Item Code : num [1:30977] 1057 1057 1057 1057 1057 ...
$ Item : chr [1:30977] "Chickens" "Chickens" "Chickens" "Chickens" ...
$ Year Code : num [1:30977] 1961 1962 1963 1964 1965 ...
$ Year : num [1:30977] 1961 1962 1963 1964 1965 ...
$ Unit : chr [1:30977] "1000 Head" "1000 Head" "1000 Head" "1000 Head" ...
$ Value : num [1:30977] 4700 4900 5000 5300 5500 5800 6600 6290 6300 6000 ...
$ Flag : chr [1:30977] "F" "F" "F" "F" ...
$ Flag Description: chr [1:30977] "FAO estimate" "FAO estimate" "FAO estimate" "FAO estimate" ...
- attr(*, "spec")=
.. cols(
.. `Domain Code` = col_character(),
.. Domain = col_character(),
.. `Area Code` = col_double(),
.. Area = col_character(),
.. `Element Code` = col_double(),
.. Element = col_character(),
.. `Item Code` = col_double(),
.. Item = col_character(),
.. `Year Code` = col_double(),
.. Year = col_double(),
.. Unit = col_character(),
.. Value = col_double(),
.. Flag = col_character(),
.. `Flag Description` = col_character()
.. )
- attr(*, "problems")=<externalptr>
We clean all variables with the word "code" in the name as these variables are only unique identifiers for other variables.
# A tibble: 6 × 9
Domain Area Element Item Year Unit Value Flag `Flag Description`
<chr> <chr> <chr> <chr> <dbl> <chr> <dbl> <chr> <chr>
1 Live Animals Afghani… Stocks Chic… 1961 1000… 4700 F FAO estimate
2 Live Animals Afghani… Stocks Chic… 1962 1000… 4900 F FAO estimate
3 Live Animals Afghani… Stocks Chic… 1963 1000… 5000 F FAO estimate
4 Live Animals Afghani… Stocks Chic… 1964 1000… 5300 F FAO estimate
5 Live Animals Afghani… Stocks Chic… 1965 1000… 5500 F FAO estimate
6 Live Animals Afghani… Stocks Chic… 1966 1000… 5800 F FAO estimate
We now try to describe the data and further filter the data if the necessary.
We first find the number of unique values in each variable
As we can see that Domain, Element and Unit have only 1 unique value, they can also be dropped
Conduct some exploratory data analysis, using dplyr commands such as group_by()
, select()
, filter()
, and summarise()
. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
# A tibble: 5 × 2
Item avg_stocks
<chr> <dbl>
1 Chickens 207931.
2 Ducks 23072.
3 Geese and guinea fowls 10292.
4 Pigeons, other birds 6163.
5 Turkeys 15228.
We try to understand the sizes of the five livestock in the datset and which countries have how much poultry during the given time frame. By grouping in the following way, it can be seen that the countries have a lot more chicken as livestock compared to the other poultry.
# A tibble: 248 × 2
Area avg_stocks
<chr> <dbl>
1 Afghanistan 8099.
2 Africa 196561.
3 Albania 2278.
4 Algeria 17621.
5 American Samoa 41.4
6 Americas 856356.
7 Angola 9453.
8 Antigua and Barbuda 93.6
9 Argentina 18844.
10 Armenia 2062.
# ℹ 238 more rows
---
title: "Challenge 2"
author: "Anirudh Lakkaraju"
description: "Data wrangling: using group() and summarise()"
date: "04/30/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_2
- Anirudh Lakkaraju
- birds.csv
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to
1) read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
2) provide summary statistics for different interesting groups within the data, and interpret those statistics
## Read in the Data
Read in one (or more) of the following data sets, available in the `posts/_data` folder, using the correct R package and command.
- railroad\*.csv or StateCounty2012.xls ⭐
- FAOstat\*.csv or birds.csv ⭐⭐⭐
- hotel_bookings.csv ⭐⭐⭐⭐
```{r}
df <- read_csv("_data/birds.csv")
df
```
```{r}
head(df)
```
```{r}
tail(df)
```
```{r}
nrow(df)
```
```{r}
summary(df)
```
```{r}
str(df)
```
### Cleaning the data
We clean all variables with the word \"code\" in the name as these variables are only unique identifiers for other variables.
```{r}
df_cleaned<-df%>%select(-c(contains("Code")))
head(df_cleaned)
```
## Describe the data
We now try to describe the data and further filter the data if the necessary.
We first find the number of unique values in each variable
```{r}
length(unique(df_cleaned$Domain))
```
```{r}
length(unique(df_cleaned$Area))
```
```{r}
length(unique(df_cleaned$Element))
```
```{r}
length(unique(df_cleaned$Item))
```
```{r}
length(unique(df_cleaned$Year))
```
```{r}
length(unique(df_cleaned$Unit))
```
```{r}
length(unique(df_cleaned$Value))
```
```{r}
length(unique(df_cleaned$Flag))
```
```{r}
length(unique(df_cleaned$`Flag Description`))
```
As we can see that Domain, Element and Unit have only 1 unique value, they can also be dropped
```{r}
df_cleaned<-df_cleaned%>%select(-c(Domain, Element, Unit))
unique(df_cleaned$Item)
```
## Provide Grouped Summary Statistics
Conduct some exploratory data analysis, using dplyr commands such as `group_by()`, `select()`, `filter()`, and `summarise()`. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
```{r}
df_cleaned %>%
group_by(Item)%>%
summarise(avg_stocks = mean(Value, na.rm = TRUE))
```
### Explain and Interpret
We try to understand the sizes of the five livestock in the datset and which countries have how much poultry during the given time frame. By grouping in the following way, it can be seen that the countries have a lot more chicken as livestock compared to the other poultry.
```{r}
df_cleaned %>%
group_by(Area)%>%
summarize(avg_stocks = mean(Value, na.rm = TRUE))
```