Code
library(tidyverse)
library(tidyr)
library(summarytools)
library(readr)
::opts_chunk$set(echo = TRUE) knitr
Sarah McAlpine
September 20, 2022
Today’s challenge is to read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc.) Next, provide summary statistics for different interesting groups within the data, and interpret those statistics.
I will read in the FAO cattle data for my first attempt and may return to this to do another.
Rows: 36449 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): Domain Code, Domain, Area, Element, Item, Unit, Flag, Flag Description
dbl (6): Area Code, Element Code, Item Code, Year Code, Year, Value
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1] "Domain Code" "Domain" "Area Code" "Area"
[5] "Element Code" "Element" "Item Code" "Item"
[9] "Year Code" "Year" "Unit" "Value"
[13] "Flag" "Flag Description"
# A tibble: 6 × 5
# Groups: Area [1]
Area Element Year Unit Value
<chr> <chr> <dbl> <chr> <dbl>
1 Afghanistan Milk Animals 1961 Head 700000
2 Afghanistan Yield 1961 hg/An 5000
3 Afghanistan Production 1961 tonnes 350000
4 Afghanistan Milk Animals 1962 Head 700000
5 Afghanistan Yield 1962 hg/An 5000
6 Afghanistan Production 1962 tonnes 350000
Next I will print a summary with only the columns that have multiple distinct values.
Variable | Stats / Values | Freqs (% of Valid) | Graph | Valid | Missing | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Area [character] |
|
|
36449 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Element [character] |
|
|
36449 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Year [numeric] |
|
58 distinct values | 36449 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Unit [character] |
|
|
36449 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Value [numeric] |
|
24088 distinct values | 36375 (99.8%) | 74 (0.2%) |
Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-20
I see that there are 3 “Elements”: Milk Animals, Production, and Yield; and 3 “Units” and I suspect these are aligned where each element is measured in its own unit. In order to get a better view of this data and ensure that each column contains only one factor, I will move those into their own columns with `pivot_wider’. See my attempt below.
# Move the units into their own columns, pulling their values along.
#use read_csv over read.csv
pivot_wider(cattle.sm, id_cols = c(Area, Year),
names_from = Unit,
values_from = Value)%>%
# Rename columns to describe the moved variables
rename(Head_Cattle = Head, Animal_hg = "hg/An", Tonnes_Milk = tonnes)%>%
#Stop showing the elements, now that they are tied to the units in the new column names
select(Area, Year, Head_Cattle, Animal_hg, Tonnes_Milk) %>%
#attempt to remove the NAs and collapse the rows together. Didn't work--need to try again next time.
head()
# A tibble: 6 × 5
Area Year Head_Cattle Animal_hg Tonnes_Milk
<chr> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan 1961 700000 5000 350000
2 Afghanistan 1962 700000 5000 350000
3 Afghanistan 1963 780000 5128 400000
4 Afghanistan 1964 780000 5128 400000
5 Afghanistan 1965 870000 5172 450000
6 Afghanistan 1966 960000 5208 500000
Now that I have explored the data, I’m ready to set up my read-in with all the tidying all at once so that I can work with a single data set for my analysis. Below is my script to do so. Since all the steps are piped in one script, the assignment of “clean.dairy” applies to the new, tidy data set.
library(tidyverse)
clean.dairy <- read_csv("_data/FAOSTAT_cattle_dairy.csv") %>%
filter(Flag!="A")%>%
select(Area, Year, Unit, Value) %>%
pivot_wider(names_from = Unit,
values_from = Value) %>%
# Rename columns to describe the moved variables
rename(Head_Cattle = Head, hg_Animal = "hg/An", Tonnes_Milk = tonnes)
Rows: 36449 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): Domain Code, Domain, Area, Element, Item, Unit, Flag, Flag Description
dbl (6): Area Code, Element Code, Item Code, Year Code, Year, Value
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 6 × 5
Area Year Head_Cattle hg_Animal Tonnes_Milk
<chr> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan 1961 700000 5000 350000
2 Afghanistan 1962 700000 5000 350000
3 Afghanistan 1963 780000 5128 400000
4 Afghanistan 1964 780000 5128 400000
5 Afghanistan 1965 870000 5172 450000
6 Afghanistan 1966 960000 5208 500000
As shown below, average (mean) values of these three variables ??what do I observe??
Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
# Was:
data %>% select(clean.dairy)
# Now:
data %>% select(all_of(clean.dairy))
See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
Error in `select()`:
! Can't subset columns with `clean.dairy`.
✖ `clean.dairy` must be numeric or character, not a <tbl_df/tbl/data.frame> object.
Error in mean(Head_Cattle, na.rm = TRUE): object 'Head_Cattle' not found
# A tibble: 1 × 4
avg_hg_Animal med_hg_Animal max_hg_Animal min_hg_Animal
<dbl> <dbl> <dbl> <dbl>
1 19329. 13218 134121 923
# A tibble: 1 × 4
avg_Tonnes_Milk med_Tonnes_Milk max_Tonnes_Milk min_Tonnes_Milk
<dbl> <dbl> <dbl> <dbl>
1 412549. 28014 25270000 7
# A tibble: 58 × 1
Year
<dbl>
1 1961
2 1962
3 1963
4 1964
5 1965
6 1966
7 1967
8 1968
9 1969
10 1970
# … with 48 more rows
# A tibble: 232 × 1
Area
<chr>
1 Afghanistan
2 Albania
3 Algeria
4 American Samoa
5 Angola
6 Antigua and Barbuda
7 Argentina
8 Armenia
9 Australia
10 Austria
# … with 222 more rows
The FAOSTAT data on cattle contains 58 years of data from 1961 to 2018 on cattle meat and milk production for 204 countries (232 minus 28 “aggregate” regions).
For my analysis, I am interested in the top producers of meat and milk. I have relocated the “Items” of production to their own columns for a tidy set of data. From there, I can observe the max, min, and averages of heads of cattle, meat production, and milk production. The top 10 producers of Meat in hectograms is shown below, and of milk in tonnes below that. Interestingly, Canada is the only country that appears in both lists.
As a side note, I am not sure why Northern America appears in the top beef producers table, since I removed aggregate values when I read in the data. Perhaps there was an error there.
In the future, it may be interesting to compare rates of meat and milk production in proportion to the head of cattle. Another question to investigate could be comparing countries’ production vs consumption of cattle products.
# A tibble: 10 × 2
Area hg_Animal
<chr> <dbl>
1 Israel 134121
2 Republic of Korea 106757
3 United States of America 104633
4 Northern America 102840
5 Saudi Arabia 100400
6 Denmark 98952
7 Spain 96660
8 Estonia 93615
9 Canada 89357
10 Finland 88343
# A tibble: 10 × 2
Area Tonnes_Milk
<chr> <dbl>
1 France 25270000
2 India 21500000
3 Netherlands 10634163
4 Brazil 9022000
5 Canada 8564000
6 Argentina 8100000
7 Turkey 7909600
8 Colombia 7100000
9 Iran (Islamic Republic of) 6929212
10 Sudan (former) 5400000
---
title: "Sarah McAlpine - Challenge 2"
author: "Sarah McAlpine"
desription: "Reading in Data and Summarizing Its Groups"
date: "09/20/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_2
- sarahmcalpine
- cattle
---
```{r}
#| label: setup
#| warning: false
library(tidyverse)
library(tidyr)
library(summarytools)
library(readr)
knitr::opts_chunk$set(echo = TRUE)
```
## Challenge Overview
Today's challenge is to read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc.) Next, provide summary statistics for different interesting groups within the data, and interpret those statistics.
## Reading in the Data
I will read in the FAO cattle data for my first attempt and may return to this to do another.
```{r}
#read in the data. Use read_csv over read.csv
cattle<- read_csv("_data/FAOSTAT_cattle_dairy.csv")
colnames(cattle)
#preview data without the 100% values
head(select(cattle, Area, Element, Year, Unit, Value))%>%
group_by(Area) %>%
slice(1:10)
# sort out the 3 different Elements into their own columns?
```
## Refining Distinct Values
Next I will print a summary with only the columns that have multiple distinct values.
```{r}
cattle.sm <- select(cattle, Area, Element, Year, Unit, Value)
# preview with data frame summary
print (dfSummary (cattle.sm, varnumbers = FALSE,
plain.ascii = FALSE,
style = "grid" ),
method = 'render',
table.classes = 'table-condensed')
```
## Rearranging the Data
I see that there are 3 "Elements": Milk Animals, Production, and Yield; and 3 "Units" and I suspect these are aligned where each element is measured in its own unit. In order to get a better view of this data and ensure that each column contains only one factor, I will move those into their own columns with `pivot_wider'. See my attempt below.
```{r}
# Move the units into their own columns, pulling their values along.
#use read_csv over read.csv
pivot_wider(cattle.sm, id_cols = c(Area, Year),
names_from = Unit,
values_from = Value)%>%
# Rename columns to describe the moved variables
rename(Head_Cattle = Head, Animal_hg = "hg/An", Tonnes_Milk = tonnes)%>%
#Stop showing the elements, now that they are tied to the units in the new column names
select(Area, Year, Head_Cattle, Animal_hg, Tonnes_Milk) %>%
#attempt to remove the NAs and collapse the rows together. Didn't work--need to try again next time.
head()
#would I use case_when for collapsing the rows together?
#How do I assign a new name to this format of the data? Where can I put that command, near the beginning?
```
## Reading and Tidying Together
Now that I have explored the data, I'm ready to set up my read-in with all the tidying all at once so that I can work with a single data set for my analysis. Below is my script to do so. Since all the steps are piped in one script, the assignment of "clean.dairy" applies to the new, tidy data set.
```{r}
library(tidyverse)
clean.dairy <- read_csv("_data/FAOSTAT_cattle_dairy.csv") %>%
filter(Flag!="A")%>%
select(Area, Year, Unit, Value) %>%
pivot_wider(names_from = Unit,
values_from = Value) %>%
# Rename columns to describe the moved variables
rename(Head_Cattle = Head, hg_Animal = "hg/An", Tonnes_Milk = tonnes)
#preview to confirm
head(clean.dairy)
```
## Summary Statistics
As shown below, average (mean) values of these three variables ??what do I observe??
```{r}
#the following works for a table of min, max, avg for Head_Cattle, hg_Animal, and Tonnes_Milk.
clean.dairy %>%
select(clean.dairy, Area = China)
summarize(avg_Head_Cattle = mean (Head_Cattle, na.rm=TRUE),
med_Head_Cattle = median (Head_Cattle, na.rm=TRUE),
max_Head_Cattle = max (Head_Cattle, na.rm=TRUE),
min_Head_Cattle = min(Head_Cattle, na.rm=TRUE))
clean.dairy %>%
summarize(avg_hg_Animal = mean (hg_Animal, na.rm=TRUE),
med_hg_Animal = median (hg_Animal, na.rm=TRUE),
max_hg_Animal = max (hg_Animal, na.rm=TRUE),
min_hg_Animal = min(hg_Animal, na.rm=TRUE))
clean.dairy %>%
summarize(avg_Tonnes_Milk = mean (Tonnes_Milk, na.rm=TRUE),
med_Tonnes_Milk = median (Tonnes_Milk, na.rm=TRUE),
max_Tonnes_Milk = max (Tonnes_Milk, na.rm=TRUE),
min_Tonnes_Milk = min(Tonnes_Milk, na.rm=TRUE))
#Counts of unique occurences
distinct(clean.dairy, Year)
distinct(clean.dairy, Area)
```
## Summary
The FAOSTAT data on cattle contains 58 years of data from 1961 to 2018 on cattle meat and milk production for 204 countries (232 minus 28 "aggregate" regions).
For my analysis, I am interested in the top producers of meat and milk. I have relocated the "Items" of production to their own columns for a tidy set of data. From there, I can observe the max, min, and averages of heads of cattle, meat production, and milk production. The top 10 producers of Meat in hectograms is shown below, and of milk in tonnes below that. Interestingly, Canada is the only country that appears in both lists.
As a side note, I am not sure why Northern America appears in the top beef producers table, since I removed aggregate values when I read in the data. Perhaps there was an error there.
In the future, it may be interesting to compare rates of meat and milk production in proportion to the head of cattle. Another question to investigate could be comparing countries' production vs consumption of cattle products.
```{r}
# Top 10 Beef Producers in a single year
clean.dairy %>%
select(Area, hg_Animal)%>%
group_by(Area) %>%
arrange(desc(hg_Animal)) %>%
slice(1)%>%
ungroup()%>%
arrange(desc(hg_Animal)) %>%
slice(1:10)
# Top 10 Milk Producers in a single year
clean.dairy %>%
select(Area, Tonnes_Milk)%>%
group_by(Area) %>%
arrange(desc(Tonnes_Milk)) %>%
slice(1)%>%
ungroup()%>%
arrange(desc(Tonnes_Milk)) %>%
slice(1:10)
```