DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Sarah McAlpine - Challenge 2

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Challenge Overview
  • Reading in the Data
  • Refining Distinct Values
  • Rearranging the Data
  • Reading and Tidying Together
  • Summary Statistics
  • Summary

Sarah McAlpine - Challenge 2

  • Show All Code
  • Hide All Code

  • View Source
challenge_2
sarahmcalpine
cattle
Author

Sarah McAlpine

Published

September 20, 2022

Code
library(tidyverse)
library(tidyr)
library(summarytools)
library(readr)

knitr::opts_chunk$set(echo = TRUE)

Challenge Overview

Today’s challenge is to read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc.) Next, provide summary statistics for different interesting groups within the data, and interpret those statistics.

Reading in the Data

I will read in the FAO cattle data for my first attempt and may return to this to do another.

Code
#read in the data. Use read_csv over read.csv
cattle<- read_csv("_data/FAOSTAT_cattle_dairy.csv")
Rows: 36449 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): Domain Code, Domain, Area, Element, Item, Unit, Flag, Flag Description
dbl (6): Area Code, Element Code, Item Code, Year Code, Year, Value

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
colnames(cattle)
 [1] "Domain Code"      "Domain"           "Area Code"        "Area"            
 [5] "Element Code"     "Element"          "Item Code"        "Item"            
 [9] "Year Code"        "Year"             "Unit"             "Value"           
[13] "Flag"             "Flag Description"
Code
#preview data without the 100% values
head(select(cattle, Area, Element, Year, Unit, Value))%>%
  group_by(Area) %>%
slice(1:10)
# A tibble: 6 × 5
# Groups:   Area [1]
  Area        Element       Year Unit    Value
  <chr>       <chr>        <dbl> <chr>   <dbl>
1 Afghanistan Milk Animals  1961 Head   700000
2 Afghanistan Yield         1961 hg/An    5000
3 Afghanistan Production    1961 tonnes 350000
4 Afghanistan Milk Animals  1962 Head   700000
5 Afghanistan Yield         1962 hg/An    5000
6 Afghanistan Production    1962 tonnes 350000
Code
# sort out the 3 different Elements into their own columns?

Refining Distinct Values

Next I will print a summary with only the columns that have multiple distinct values.

Code
cattle.sm <- select(cattle, Area, Element, Year, Unit, Value)

# preview with data frame summary
print (dfSummary (cattle.sm, varnumbers = FALSE,
                plain.ascii = FALSE,
                style = "grid" ),
      method = 'render', 
      table.classes = 'table-condensed')

Data Frame Summary

cattle.sm

Dimensions: 36449 x 5
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
Area [character]
1. Afghanistan
2. Africa
3. Albania
4. Algeria
5. American Samoa
6. Americas
7. Angola
8. Antigua and Barbuda
9. Argentina
10. Asia
[ 222 others ]
174(0.5%)
174(0.5%)
174(0.5%)
174(0.5%)
174(0.5%)
174(0.5%)
174(0.5%)
174(0.5%)
174(0.5%)
174(0.5%)
34709(95.2%)
36449 (100.0%) 0 (0.0%)
Element [character]
1. Milk Animals
2. Production
3. Yield
12158(33.4%)
12170(33.4%)
12121(33.3%)
36449 (100.0%) 0 (0.0%)
Year [numeric]
Mean (sd) : 1990.4 (16.8)
min ≤ med ≤ max:
1961 ≤ 1991 ≤ 2018
IQR (CV) : 29 (0)
58 distinct values 36449 (100.0%) 0 (0.0%)
Unit [character]
1. Head
2. hg/An
3. tonnes
12158(33.4%)
12121(33.3%)
12170(33.4%)
36449 (100.0%) 0 (0.0%)
Value [numeric]
Mean (sd) : 4410235 (25744621)
min ≤ med ≤ max:
7 ≤ 43266 ≤ 683217055
IQR (CV) : 692151 (5.8)
24088 distinct values 36375 (99.8%) 74 (0.2%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-20

Rearranging the Data

I see that there are 3 “Elements”: Milk Animals, Production, and Yield; and 3 “Units” and I suspect these are aligned where each element is measured in its own unit. In order to get a better view of this data and ensure that each column contains only one factor, I will move those into their own columns with `pivot_wider’. See my attempt below.

Code
# Move the units into their own columns, pulling their values along. 

#use read_csv over read.csv
pivot_wider(cattle.sm, id_cols = c(Area, Year), 
            names_from = Unit, 
            values_from = Value)%>%
  # Rename columns to describe the moved variables
  rename(Head_Cattle = Head, Animal_hg = "hg/An", Tonnes_Milk = tonnes)%>%
 #Stop showing the elements, now that they are tied to the units in the new column names
   select(Area, Year, Head_Cattle, Animal_hg, Tonnes_Milk) %>%
  #attempt to remove the NAs and collapse the rows together. Didn't work--need to try again next time.
  head()
# A tibble: 6 × 5
  Area         Year Head_Cattle Animal_hg Tonnes_Milk
  <chr>       <dbl>       <dbl>     <dbl>       <dbl>
1 Afghanistan  1961      700000      5000      350000
2 Afghanistan  1962      700000      5000      350000
3 Afghanistan  1963      780000      5128      400000
4 Afghanistan  1964      780000      5128      400000
5 Afghanistan  1965      870000      5172      450000
6 Afghanistan  1966      960000      5208      500000
Code
 #would I use case_when for collapsing the rows together? 
#How do I assign a new name to this format of the data? Where can I put that command, near the beginning?

Reading and Tidying Together

Now that I have explored the data, I’m ready to set up my read-in with all the tidying all at once so that I can work with a single data set for my analysis. Below is my script to do so. Since all the steps are piped in one script, the assignment of “clean.dairy” applies to the new, tidy data set.

Code
library(tidyverse)
clean.dairy <- read_csv("_data/FAOSTAT_cattle_dairy.csv") %>%
  filter(Flag!="A")%>%

  select(Area, Year, Unit, Value) %>%

  pivot_wider(names_from = Unit, 
            values_from = Value) %>%
  # Rename columns to describe the moved variables
  rename(Head_Cattle = Head, hg_Animal = "hg/An", Tonnes_Milk = tonnes)
Rows: 36449 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): Domain Code, Domain, Area, Element, Item, Unit, Flag, Flag Description
dbl (6): Area Code, Element Code, Item Code, Year Code, Year, Value

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
#preview to confirm
head(clean.dairy)
# A tibble: 6 × 5
  Area         Year Head_Cattle hg_Animal Tonnes_Milk
  <chr>       <dbl>       <dbl>     <dbl>       <dbl>
1 Afghanistan  1961      700000      5000      350000
2 Afghanistan  1962      700000      5000      350000
3 Afghanistan  1963      780000      5128      400000
4 Afghanistan  1964      780000      5128      400000
5 Afghanistan  1965      870000      5172      450000
6 Afghanistan  1966      960000      5208      500000

Summary Statistics

As shown below, average (mean) values of these three variables ??what do I observe??

Code
  #the following works for a table of min, max, avg for Head_Cattle, hg_Animal, and Tonnes_Milk.
clean.dairy %>%
  select(clean.dairy, Area = China)
Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
  # Was:
  data %>% select(clean.dairy)

  # Now:
  data %>% select(all_of(clean.dairy))

See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
Error in `select()`:
! Can't subset columns with `clean.dairy`.
✖ `clean.dairy` must be numeric or character, not a <tbl_df/tbl/data.frame> object.
Code
   summarize(avg_Head_Cattle = mean (Head_Cattle, na.rm=TRUE),
          med_Head_Cattle = median (Head_Cattle, na.rm=TRUE), 
          max_Head_Cattle = max (Head_Cattle, na.rm=TRUE),
          min_Head_Cattle = min(Head_Cattle, na.rm=TRUE))
Error in mean(Head_Cattle, na.rm = TRUE): object 'Head_Cattle' not found
Code
 clean.dairy %>%
  summarize(avg_hg_Animal = mean (hg_Animal, na.rm=TRUE),
          med_hg_Animal = median (hg_Animal, na.rm=TRUE), 
          max_hg_Animal = max (hg_Animal, na.rm=TRUE),
          min_hg_Animal = min(hg_Animal, na.rm=TRUE))   
# A tibble: 1 × 4
  avg_hg_Animal med_hg_Animal max_hg_Animal min_hg_Animal
          <dbl>         <dbl>         <dbl>         <dbl>
1        19329.         13218        134121           923
Code
 clean.dairy %>%

  summarize(avg_Tonnes_Milk = mean (Tonnes_Milk, na.rm=TRUE),
          med_Tonnes_Milk = median (Tonnes_Milk, na.rm=TRUE), 
          max_Tonnes_Milk = max (Tonnes_Milk, na.rm=TRUE),
          min_Tonnes_Milk = min(Tonnes_Milk, na.rm=TRUE))  
# A tibble: 1 × 4
  avg_Tonnes_Milk med_Tonnes_Milk max_Tonnes_Milk min_Tonnes_Milk
            <dbl>           <dbl>           <dbl>           <dbl>
1         412549.           28014        25270000               7
Code
 #Counts of unique occurences
 distinct(clean.dairy, Year)
# A tibble: 58 × 1
    Year
   <dbl>
 1  1961
 2  1962
 3  1963
 4  1964
 5  1965
 6  1966
 7  1967
 8  1968
 9  1969
10  1970
# … with 48 more rows
Code
  distinct(clean.dairy, Area)
# A tibble: 232 × 1
   Area               
   <chr>              
 1 Afghanistan        
 2 Albania            
 3 Algeria            
 4 American Samoa     
 5 Angola             
 6 Antigua and Barbuda
 7 Argentina          
 8 Armenia            
 9 Australia          
10 Austria            
# … with 222 more rows

Summary

The FAOSTAT data on cattle contains 58 years of data from 1961 to 2018 on cattle meat and milk production for 204 countries (232 minus 28 “aggregate” regions).

For my analysis, I am interested in the top producers of meat and milk. I have relocated the “Items” of production to their own columns for a tidy set of data. From there, I can observe the max, min, and averages of heads of cattle, meat production, and milk production. The top 10 producers of Meat in hectograms is shown below, and of milk in tonnes below that. Interestingly, Canada is the only country that appears in both lists.

As a side note, I am not sure why Northern America appears in the top beef producers table, since I removed aggregate values when I read in the data. Perhaps there was an error there.

In the future, it may be interesting to compare rates of meat and milk production in proportion to the head of cattle. Another question to investigate could be comparing countries’ production vs consumption of cattle products.

Code
# Top 10 Beef Producers in a single year
clean.dairy %>%
select(Area, hg_Animal)%>%
   group_by(Area) %>%
  arrange(desc(hg_Animal)) %>%
            slice(1)%>%
  ungroup()%>%
  arrange(desc(hg_Animal)) %>%
  slice(1:10)
# A tibble: 10 × 2
   Area                     hg_Animal
   <chr>                        <dbl>
 1 Israel                      134121
 2 Republic of Korea           106757
 3 United States of America    104633
 4 Northern America            102840
 5 Saudi Arabia                100400
 6 Denmark                      98952
 7 Spain                        96660
 8 Estonia                      93615
 9 Canada                       89357
10 Finland                      88343
Code
# Top 10 Milk Producers in a single year
clean.dairy %>%
select(Area, Tonnes_Milk)%>%
   group_by(Area) %>%
  arrange(desc(Tonnes_Milk)) %>%
            slice(1)%>%
  ungroup()%>%
  arrange(desc(Tonnes_Milk)) %>%
  slice(1:10)
# A tibble: 10 × 2
   Area                       Tonnes_Milk
   <chr>                            <dbl>
 1 France                        25270000
 2 India                         21500000
 3 Netherlands                   10634163
 4 Brazil                         9022000
 5 Canada                         8564000
 6 Argentina                      8100000
 7 Turkey                         7909600
 8 Colombia                       7100000
 9 Iran (Islamic Republic of)     6929212
10 Sudan (former)                 5400000
Source Code
---
title: "Sarah McAlpine - Challenge 2"
author: "Sarah McAlpine"
desription: "Reading in Data and Summarizing Its Groups"
date: "09/20/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_2
  - sarahmcalpine
  - cattle
---

```{r}
#| label: setup
#| warning: false

library(tidyverse)
library(tidyr)
library(summarytools)
library(readr)

knitr::opts_chunk$set(echo = TRUE)
```

## Challenge Overview

Today's challenge is to read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc.) Next, provide summary statistics for different interesting groups within the data, and interpret those statistics.

## Reading in the Data
I will read in the FAO cattle data for my first attempt and may return to this to do another.


```{r}
#read in the data. Use read_csv over read.csv
cattle<- read_csv("_data/FAOSTAT_cattle_dairy.csv")
colnames(cattle)

#preview data without the 100% values
head(select(cattle, Area, Element, Year, Unit, Value))%>%
  group_by(Area) %>%
slice(1:10)

# sort out the 3 different Elements into their own columns?

```
## Refining Distinct Values
Next I will print a summary with only the columns that have multiple distinct values.
```{r}
cattle.sm <- select(cattle, Area, Element, Year, Unit, Value)

# preview with data frame summary
print (dfSummary (cattle.sm, varnumbers = FALSE,
                plain.ascii = FALSE,
                style = "grid" ),
      method = 'render', 
      table.classes = 'table-condensed')
```

## Rearranging the Data
I see that there are 3 "Elements": Milk Animals, Production, and Yield; and 3 "Units" and I suspect these are aligned where each element is measured in its own unit. In order to get a better view of this data and ensure that each column contains only one factor, I will move those into their own columns with `pivot_wider'. See my attempt below.
```{r}
# Move the units into their own columns, pulling their values along. 

#use read_csv over read.csv
pivot_wider(cattle.sm, id_cols = c(Area, Year), 
            names_from = Unit, 
            values_from = Value)%>%
  # Rename columns to describe the moved variables
  rename(Head_Cattle = Head, Animal_hg = "hg/An", Tonnes_Milk = tonnes)%>%
 #Stop showing the elements, now that they are tied to the units in the new column names
   select(Area, Year, Head_Cattle, Animal_hg, Tonnes_Milk) %>%
  #attempt to remove the NAs and collapse the rows together. Didn't work--need to try again next time.
  head()
 #would I use case_when for collapsing the rows together? 
#How do I assign a new name to this format of the data? Where can I put that command, near the beginning?

```
## Reading and Tidying Together
Now that I have explored the data, I'm ready to set up my read-in with all the tidying all at once so that I can work with a single data set for my analysis. Below is my script to do so. Since all the steps are piped in one script, the assignment of "clean.dairy" applies to the new, tidy data set. 
```{r}
library(tidyverse)
clean.dairy <- read_csv("_data/FAOSTAT_cattle_dairy.csv") %>%
  filter(Flag!="A")%>%

  select(Area, Year, Unit, Value) %>%

  pivot_wider(names_from = Unit, 
            values_from = Value) %>%
  # Rename columns to describe the moved variables
  rename(Head_Cattle = Head, hg_Animal = "hg/An", Tonnes_Milk = tonnes)
#preview to confirm
head(clean.dairy)
```
## Summary Statistics
As shown below, average (mean) values of these three variables ??what do I observe??

```{r}
  #the following works for a table of min, max, avg for Head_Cattle, hg_Animal, and Tonnes_Milk.
clean.dairy %>%
  select(clean.dairy, Area = China)
   summarize(avg_Head_Cattle = mean (Head_Cattle, na.rm=TRUE),
          med_Head_Cattle = median (Head_Cattle, na.rm=TRUE), 
          max_Head_Cattle = max (Head_Cattle, na.rm=TRUE),
          min_Head_Cattle = min(Head_Cattle, na.rm=TRUE))
          
 clean.dairy %>%
  summarize(avg_hg_Animal = mean (hg_Animal, na.rm=TRUE),
          med_hg_Animal = median (hg_Animal, na.rm=TRUE), 
          max_hg_Animal = max (hg_Animal, na.rm=TRUE),
          min_hg_Animal = min(hg_Animal, na.rm=TRUE))   
 
 clean.dairy %>%

  summarize(avg_Tonnes_Milk = mean (Tonnes_Milk, na.rm=TRUE),
          med_Tonnes_Milk = median (Tonnes_Milk, na.rm=TRUE), 
          max_Tonnes_Milk = max (Tonnes_Milk, na.rm=TRUE),
          min_Tonnes_Milk = min(Tonnes_Milk, na.rm=TRUE))  

 #Counts of unique occurences
 distinct(clean.dairy, Year)

  distinct(clean.dairy, Area)

```
## Summary
The FAOSTAT data on cattle contains 58 years of data from 1961 to 2018 on cattle meat and milk production for 204 countries (232 minus 28 "aggregate" regions). 

For my analysis, I am interested in the top producers of meat and milk. I have relocated the "Items" of production to their own columns for a tidy set of data. From there, I can observe the max, min, and averages of heads of cattle, meat production, and milk production. The top 10 producers of Meat in hectograms is shown below, and of milk in tonnes below that. Interestingly, Canada is the only country that appears in both lists. 

As a side note, I am not sure why Northern America appears in the top beef producers table, since I removed aggregate values when I read in the data. Perhaps there was an error there.

In the future, it may be interesting to compare rates of meat and milk production in proportion to the head of cattle. Another question to investigate could be comparing countries' production vs consumption of cattle products.

```{r}
# Top 10 Beef Producers in a single year
clean.dairy %>%
select(Area, hg_Animal)%>%
   group_by(Area) %>%
  arrange(desc(hg_Animal)) %>%
            slice(1)%>%
  ungroup()%>%
  arrange(desc(hg_Animal)) %>%
  slice(1:10)

# Top 10 Milk Producers in a single year
clean.dairy %>%
select(Area, Tonnes_Milk)%>%
   group_by(Area) %>%
  arrange(desc(Tonnes_Milk)) %>%
            slice(1)%>%
  ungroup()%>%
  arrange(desc(Tonnes_Milk)) %>%
  slice(1:10)
```