Code
library(tidyverse)
library(here)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Daniel Manning
December 28, 2022
Today’s challenge is to
Read in one (or more) of the following data sets, available in the posts/_data
folder, using the correct R package and command.
# A tibble: 82,116 × 14
Domain Cod…¹ Domain Area …² Area Eleme…³ Element Item …⁴ Item Year …⁵ Year
<chr> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl>
1 QA Live … 2 Afgh… 5111 Stocks 1107 Asses 1961 1961
2 QA Live … 2 Afgh… 5111 Stocks 1107 Asses 1962 1962
3 QA Live … 2 Afgh… 5111 Stocks 1107 Asses 1963 1963
4 QA Live … 2 Afgh… 5111 Stocks 1107 Asses 1964 1964
5 QA Live … 2 Afgh… 5111 Stocks 1107 Asses 1965 1965
6 QA Live … 2 Afgh… 5111 Stocks 1107 Asses 1966 1966
7 QA Live … 2 Afgh… 5111 Stocks 1107 Asses 1967 1967
8 QA Live … 2 Afgh… 5111 Stocks 1107 Asses 1968 1968
9 QA Live … 2 Afgh… 5111 Stocks 1107 Asses 1969 1969
10 QA Live … 2 Afgh… 5111 Stocks 1107 Asses 1970 1970
# … with 82,106 more rows, 4 more variables: Unit <chr>, Value <dbl>,
# Flag <chr>, `Flag Description` <chr>, and abbreviated variable names
# ¹`Domain Code`, ²`Area Code`, ³`Element Code`, ⁴`Item Code`, ⁵`Year Code`
Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.
In the above cell, I used read.csv() to store the FAOSTAT dataset within my livestock variable. Based on the output when calling the variable livestock, the variable types are as follows:
Domain.Code: String Domain: String Area.Code:Integer Area:String Element.Code: Integer Element:String Item.Code: Integer Item: String Year.Code: Integer Year: Integer Unit: String Value: Integer Flag: String Flag.Description String
Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).
First I used the head() function to determine that the dataset has fourteen variables. This dataset was likely gathered by survey. Then I used dim() to show that their are 82116 rows, representing the total number of values of livestock counts/estimates spanning various countries (Areas), Types (Items), and years (Years). Next I stored the columns into individual variables for ease of use later on. I used the n_distinct() and distinct() functions to show the number of unique entries in each column and to ouput these values.
# A tibble: 6 × 14
Domai…¹ Domain Area …² Area Eleme…³ Element Item …⁴ Item Year …⁵ Year Unit
<chr> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl> <chr>
1 QA Live … 2 Afgh… 5111 Stocks 1107 Asses 1961 1961 Head
2 QA Live … 2 Afgh… 5111 Stocks 1107 Asses 1962 1962 Head
3 QA Live … 2 Afgh… 5111 Stocks 1107 Asses 1963 1963 Head
4 QA Live … 2 Afgh… 5111 Stocks 1107 Asses 1964 1964 Head
5 QA Live … 2 Afgh… 5111 Stocks 1107 Asses 1965 1965 Head
6 QA Live … 2 Afgh… 5111 Stocks 1107 Asses 1966 1966 Head
# … with 3 more variables: Value <dbl>, Flag <chr>, `Flag Description` <chr>,
# and abbreviated variable names ¹`Domain Code`, ²`Area Code`,
# ³`Element Code`, ⁴`Item Code`, ⁵`Year Code`
[1] 82116 14
[1] 1
# A tibble: 1 × 1
Domain
<chr>
1 Live Animals
[1] 253
# A tibble: 253 × 1
Area
<chr>
1 Afghanistan
2 Albania
3 Algeria
4 American Samoa
5 Angola
6 Antigua and Barbuda
7 Argentina
8 Armenia
9 Aruba
10 Australia
# … with 243 more rows
[1] 1
# A tibble: 1 × 1
Element
<chr>
1 Stocks
[1] 9
# A tibble: 9 × 1
Item
<chr>
1 Asses
2 Camels
3 Cattle
4 Goats
5 Horses
6 Mules
7 Sheep
8 Buffaloes
9 Pigs
[1] 58
# A tibble: 58 × 1
Year
<dbl>
1 1961
2 1962
3 1963
4 1964
5 1965
6 1966
7 1967
8 1968
9 1969
10 1970
# … with 48 more rows
[1] 1
# A tibble: 1 × 1
Unit
<chr>
1 Head
Conduct some exploratory data analysis, using dplyr commands such as group_by()
, select()
, filter()
, and summarise()
. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
I first used the select() function to remove any columns/variables that were redundant or contained only 1 distinct value. I then used the head() to show the first 10 rows our new dataset to confirm that the desired variables were removed. Next, I stored the value (number of a given livestock) in a new variable. I calculated summary statistics of Value, and included the ‘na.rm = TRUE’ parameter so NA values were ignored. Following this, I created a new variable (Area) using the group_by() function. Using the summarise function, I then computed the mean, median, standard deviation, and IQR for the number of livestock, grouped by country. I calculated the summary statistics of the the data grouped by type of animal (Item) and year using the same method.
Error in `select()`:
! Can't subset columns that don't exist.
✖ Column `Domain.Code` doesn't exist.
Error in head(livestock_new): object 'livestock_new' not found
Error in eval(expr, envir, enclos): object 'livestock_new' not found
[1] 11625569
[1] 224667
[1] 0
[1] 1489744504
[1] 64779790
[1] 4.196421e+15
[1] 2364200
Error in group_by(., Area): object 'livestock_new' not found
Error in UseMethod("summarise"): no applicable method for 'summarise' applied to an object of class "character"
Error in group_by(., Item): object 'livestock_new' not found
Error in UseMethod("summarise"): no applicable method for 'summarise' applied to an object of class "character"
Error in group_by(., Year): object 'livestock_new' not found
Error in UseMethod("summarise"): no applicable method for 'summarise' applied to an object of class "c('double', 'numeric')"
Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.
I chose to group the data by Area, Unit, and Year in order to see how the values differed across these observations. I noticed that the mean Values differed by country, year, and animal type. One interesting observation was that greatest average Value was the number of Cattle while the smallest average Value was mules. In addition, the total Value tended to increase every year.
---
title: "Challenge 2"
author: "Daniel Manning"
description: "Data wrangling: using group() and summarise()"
date: "12/28/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_2
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
library(here)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to
1) read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
2) provide summary statistics for different interesting groups within the data, and interpret those statistics
## Read in the Data
Read in one (or more) of the following data sets, available in the `posts/_data` folder, using the correct R package and command.
- railroad\*.csv or StateCounty2012.xlsx ⭐
- FAOstat\*.csv ⭐⭐⭐
- hotel_bookings ⭐⭐⭐⭐
```{r}
livestock <- here("posts","_data","FAOSTAT_livestock.csv")%>%
read_csv()
livestock
```
Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.
In the above cell, I used read.csv() to store the FAOSTAT dataset within my livestock variable. Based on the output when calling the variable livestock, the variable types are as follows:
Domain.Code: String
Domain: String
Area.Code:Integer
Area:String
Element.Code: Integer
Element:String
Item.Code: Integer
Item: String
Year.Code: Integer
Year: Integer
Unit: String
Value: Integer
Flag: String
Flag.Description String
## Describe the data
Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).
First I used the head() function to determine that the dataset has fourteen variables. This dataset was likely gathered by survey. Then I used dim() to show that their are 82116 rows, representing the total number of values of livestock counts/estimates spanning various countries (Areas), Types (Items), and years (Years). Next I stored the columns into individual variables for ease of use later on. I used the n_distinct() and distinct() functions to show the number of unique entries in each column and to ouput these values.
```{r}
#| label: summary
head(livestock)
dim(livestock)
Domain <- livestock$Domain
Area <- livestock$Area
Element <- livestock$Element
Item <- livestock$Item
Year <- livestock$Year
Unit <- livestock$Unit
Value <- livestock$Value
livestock %>%
select(Domain) %>%
n_distinct(.)
livestock %>%
select(Domain) %>%
distinct()
livestock %>%
select(Area) %>%
n_distinct(.)
livestock %>%
select(Area) %>%
distinct()
livestock %>%
select(Element) %>%
n_distinct(.)
livestock %>%
select(Element) %>%
distinct()
livestock %>%
select(Item) %>%
n_distinct(.)
livestock %>%
select(Item) %>%
distinct()
livestock %>%
select(Year) %>%
n_distinct(.)
livestock %>%
select(Year) %>%
distinct()
livestock %>%
select(Unit) %>%
n_distinct(.)
livestock %>%
select(Unit) %>%
distinct()
```
## Provide Grouped Summary Statistics
Conduct some exploratory data analysis, using dplyr commands such as `group_by()`, `select()`, `filter()`, and `summarise()`. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
I first used the select() function to remove any columns/variables that were redundant or contained only 1 distinct value. I then used the head() to show the first 10 rows our new dataset to confirm that the desired variables were removed. Next, I stored the value (number of a given livestock) in a new variable. I calculated summary statistics of Value, and included the 'na.rm = TRUE' parameter so NA values were ignored. Following this, I created a new variable (Area) using the group_by() function. Using the summarise function, I then computed the mean, median, standard deviation, and IQR for the number of livestock, grouped by country. I calculated the summary statistics of the the data grouped by type of animal (Item) and year using the same method.
```{r}
livestock_new <- livestock %>% select(-c('Domain', 'Domain.Code', 'Area.Code', 'Element', 'Element.Code', 'Item.Code', 'Year.Code', 'Unit', 'Flag'))
head(livestock_new)
Value <- livestock_new$Value
mean(Value, na.rm = TRUE)
median(Value, na.rm = TRUE)
min(Value, na.rm = TRUE)
max(Value, na.rm = TRUE)
sd(Value, na.rm = TRUE)
var(Value, na.rm = TRUE)
IQR(Value, na.rm = TRUE)
Area <- livestock_new %>% group_by(Area)
Area %>% summarise(Value_mean = mean(Value, na.rm = TRUE), Value_median = median(Value, na.rm = TRUE), Value_sd = sd(Value, na.rm = TRUE), Value_IQR = IQR(Value, na.rm = TRUE))
Item <- livestock_new %>% group_by(Item)
Item %>% summarise(Value_mean = mean(Value, na.rm = TRUE), Value_median = median(Value, na.rm = TRUE), Value_sd = sd(Value, na.rm = TRUE), Value_IQR = IQR(Value, na.rm = TRUE))
Year <- livestock_new %>% group_by(Year)
Year %>% summarise(Value_mean = mean(Value, na.rm = TRUE), Value_median = median(Value, na.rm = TRUE), Value_sd = sd(Value, na.rm = TRUE), Value_IQR = IQR(Value, na.rm = TRUE))
```
### Explain and Interpret
Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.
I chose to group the data by Area, Unit, and Year in order to see how the values differed across these observations. I noticed that the mean Values differed by country, year, and animal type. One interesting observation was that greatest average Value was the number of Cattle while the smallest average Value was mules. In addition, the total Value tended to increase every year.