Code
library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Nick Boonstra
August 16, 2022
This challenge involves reading in and cleaning up data from the “birds” data set.
# A tibble: 30,977 × 14
Domain Cod…¹ Domain Area …² Area Eleme…³ Element Item …⁴ Item Year …⁵ Year
<chr> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl>
1 QA Live … 2 Afgh… 5112 Stocks 1057 Chic… 1961 1961
2 QA Live … 2 Afgh… 5112 Stocks 1057 Chic… 1962 1962
3 QA Live … 2 Afgh… 5112 Stocks 1057 Chic… 1963 1963
4 QA Live … 2 Afgh… 5112 Stocks 1057 Chic… 1964 1964
5 QA Live … 2 Afgh… 5112 Stocks 1057 Chic… 1965 1965
6 QA Live … 2 Afgh… 5112 Stocks 1057 Chic… 1966 1966
7 QA Live … 2 Afgh… 5112 Stocks 1057 Chic… 1967 1967
8 QA Live … 2 Afgh… 5112 Stocks 1057 Chic… 1968 1968
9 QA Live … 2 Afgh… 5112 Stocks 1057 Chic… 1969 1969
10 QA Live … 2 Afgh… 5112 Stocks 1057 Chic… 1970 1970
# … with 30,967 more rows, 4 more variables: Unit <chr>, Value <dbl>,
# Flag <chr>, `Flag Description` <chr>, and abbreviated variable names
# ¹`Domain Code`, ²`Area Code`, ³`Element Code`, ⁴`Item Code`, ⁵`Year Code`
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
Fortunately, all of the data came in tidy, meaning:
However, while every column was nominally tidy-compliant, a number of them were extraneous, in that they provided redundant or useless data. These columns, specifically, were “Domain,” “Domain Code,” “Element,” “Element Code,” “Unit,” and “Year Code.”
“Year Code” was redundant, in that its values were equal to “Year” for every observation. Of course, attempting to verify this by hand would be borderline impossible, but luckily this was easily testable by some quick code:
# A tibble: 1 × 2
year_test n
<dbl> <int>
1 1 30977
Because the value of value of the dummy “year_test” variable is equal to 1 for all observations, we can know that “Year Code” was equal to “Year” for all observations, and thus eliminate “Year Code” without losing any information.
In the case of the other five columns named above, all observations contained the same value, making the columns practically useless. Once again, this assertion was easily testable:
# A tibble: 1 × 2
Domain n
<chr> <int>
1 Live Animals 30977
# A tibble: 1 × 2
`Domain Code` n
<chr> <int>
1 QA 30977
# A tibble: 1 × 2
Element n
<chr> <int>
1 Stocks 30977
# A tibble: 1 × 2
`Element Code` n
<dbl> <int>
1 5112 30977
# A tibble: 1 × 2
Unit n
<chr> <int>
1 1000 Head 30977
Thus, with each column only carrying one value throughout the data set, these columns could be deleted without pratically losing information. In this case, however, the values being removed from the dataframe are not already being kept somewhere else in the dataframe, as opposed to the duplication found in the Year/Year Code case. Thus, recording these values in a separate location may be desirable, depending upon the exact nature of the dataset and the desired analysis.
Beyond removing these extraneous columns, the only other adjustments I found necessary were to rename the remaining columns to abide by “snake_case.” This was done for practicality (some of the column names had spaces), consistency, and personal preference.
All of these changes, then, are seen here:
birds<-birds %>%
select(-starts_with("Domain")) %>% ## All values identical for all obs
rename(area_code = `Area Code`) %>%
rename(area = Area) %>%
select(-starts_with("Element")) %>% ## All values identical for all obs
rename(item_code = `Item Code`) %>%
rename(item = Item) %>%
select(-`Year Code`) %>% ## Values identical to Year for all obs
rename(year = Year) %>%
select(-Unit) %>% ## All values identical for all obs
rename(value = Value) %>%
rename(flag = Flag) %>%
rename(flag_desc = `Flag Description`)
birds
# A tibble: 30,977 × 8
area_code area item_code item year value flag flag_desc
<dbl> <chr> <dbl> <chr> <dbl> <dbl> <chr> <chr>
1 2 Afghanistan 1057 Chickens 1961 4700 F FAO estimate
2 2 Afghanistan 1057 Chickens 1962 4900 F FAO estimate
3 2 Afghanistan 1057 Chickens 1963 5000 F FAO estimate
4 2 Afghanistan 1057 Chickens 1964 5300 F FAO estimate
5 2 Afghanistan 1057 Chickens 1965 5500 F FAO estimate
6 2 Afghanistan 1057 Chickens 1966 5800 F FAO estimate
7 2 Afghanistan 1057 Chickens 1967 6600 F FAO estimate
8 2 Afghanistan 1057 Chickens 1968 6290 <NA> Official data
9 2 Afghanistan 1057 Chickens 1969 6300 F FAO estimate
10 2 Afghanistan 1057 Chickens 1970 6000 F FAO estimate
# … with 30,967 more rows
# ℹ Use `print(n = ...)` to see more rows
These data appear to be recording the populations of various types of birds across a number of countries and years.
[1] "area_code" "area" "item_code" "item" "year" "value"
[7] "flag" "flag_desc"
# A tibble: 5 × 2
item `Median Values by Type`
<chr> <dbl>
1 Chickens 10784.
2 Ducks 510
3 Geese and guinea fowls 258
4 Pigeons, other birds 2800
5 Turkeys 528
# A tibble: 58 × 2
year `Median Values by Year`
<dbl> <dbl>
1 1961 1033
2 1962 1014
3 1963 1106
4 1964 1103
5 1965 1104
6 1966 1088.
7 1967 1193
8 1968 1252.
9 1969 1267
10 1970 1259
# … with 48 more rows
# ℹ Use `print(n = ...)` to see more rows
# A tibble: 248 × 2
area `Median Values by Area`
<chr> <dbl>
1 Afghanistan 6700
2 Africa 12910.
3 Albania 1300
4 Algeria 42.5
5 American Samoa 38
6 Americas 66924.
7 Angola 6075
8 Antigua and Barbuda 85
9 Argentina 2355
10 Armenia 1528.
# … with 238 more rows
# ℹ Use `print(n = ...)` to see more rows
---
title: "Nick Boonstra Challenge 1 Resubmit"
author: "Nick Boonstra"
desription: "Reading in data and creating a post"
date: "08/16/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_1
- boonstra
- week_1
- birds
---
## Introduction
This challenge involves reading in and cleaning up data from the "birds" data set.
#### Setup chunk
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Reading In and Cleaning Up
```{r}
#| label: read in
birds<-read_csv("_data/birds.csv")
birds
```
Fortunately, all of the data came in tidy, meaning:
- Each column measured observations of only one variable;
- Each row provided values for only one observation; and,
- Each cell contained only one value.
### Data Cleanup
However, while every column was nominally tidy-compliant, a number of them were extraneous, in that they provided redundant or useless data. These columns, specifically, were "Domain," "Domain Code," "Element," "Element Code," "Unit," and "Year Code."
"Year Code" was redundant, in that its values were equal to "Year" for every observation. Of course, attempting to verify this by hand would be borderline impossible, but luckily this was easily testable by some quick code:
```{r}
#| label: year code test
birds_test <- read_csv("_data/birds.csv")
birds_test <- birds_test %>%
mutate(year_test = case_when(
Year == `Year Code` ~ 1,
TRUE ~ 0
))
count(birds_test,year_test)
rm(birds_test)
```
Because the value of value of the dummy "year_test" variable is equal to 1 for all observations, we can know that "Year Code" was equal to "Year" for all observations, and thus eliminate "Year Code" without losing any information.
In the case of the other five columns named above, all observations contained the same value, making the columns practically useless. Once again, this assertion was easily testable:
```{r}
#| label: single value vars
count(birds,Domain)
count(birds,`Domain Code`)
count(birds,Element)
count(birds,`Element Code`)
count(birds,Unit)
```
Thus, with each column only carrying one value throughout the data set, these columns could be deleted without pratically losing information. In this case, however, the values being removed from the dataframe are not already being kept somewhere else in the dataframe, as opposed to the duplication found in the Year/Year Code case. Thus, recording these values in a separate location may be desirable, depending upon the exact nature of the dataset and the desired analysis.
Beyond removing these extraneous columns, the only other adjustments I found necessary were to rename the remaining columns to abide by "snake_case." This was done for practicality (some of the column names had spaces), consistency, and personal preference.
All of these changes, then, are seen here:
```{r}
#| label: cleanup
birds<-birds %>%
select(-starts_with("Domain")) %>% ## All values identical for all obs
rename(area_code = `Area Code`) %>%
rename(area = Area) %>%
select(-starts_with("Element")) %>% ## All values identical for all obs
rename(item_code = `Item Code`) %>%
rename(item = Item) %>%
select(-`Year Code`) %>% ## Values identical to Year for all obs
rename(year = Year) %>%
select(-Unit) %>% ## All values identical for all obs
rename(value = Value) %>%
rename(flag = Flag) %>%
rename(flag_desc = `Flag Description`)
birds
```
## Describing the data
These data appear to be recording the populations of various types of birds across a number of countries and years.
```{r}
#| label: summary
names(birds)
birds %>%
group_by(item) %>%
summarise("Median Values by Type" = median(value,na.rm=T))
birds %>%
group_by(year) %>%
summarise("Median Values by Year" = median(value,na.rm=T))
birds %>%
group_by(area) %>%
summarise("Median Values by Area" = median(value,na.rm=T))
```