Code
library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Meredith Rolfe
August 16, 2022
Today’s challenge is to
Read in one (or more) of the following data sets, available in the posts/_data
folder, using the correct R package and command.
# A tibble: 30,977 × 14
Domain Cod…¹ Domain Area …² Area Eleme…³ Element Item …⁴ Item Year …⁵ Year
<chr> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl>
1 QA Live … 2 Afgh… 5112 Stocks 1057 Chic… 1961 1961
2 QA Live … 2 Afgh… 5112 Stocks 1057 Chic… 1962 1962
3 QA Live … 2 Afgh… 5112 Stocks 1057 Chic… 1963 1963
4 QA Live … 2 Afgh… 5112 Stocks 1057 Chic… 1964 1964
5 QA Live … 2 Afgh… 5112 Stocks 1057 Chic… 1965 1965
6 QA Live … 2 Afgh… 5112 Stocks 1057 Chic… 1966 1966
7 QA Live … 2 Afgh… 5112 Stocks 1057 Chic… 1967 1967
8 QA Live … 2 Afgh… 5112 Stocks 1057 Chic… 1968 1968
9 QA Live … 2 Afgh… 5112 Stocks 1057 Chic… 1969 1969
10 QA Live … 2 Afgh… 5112 Stocks 1057 Chic… 1970 1970
# … with 30,967 more rows, 4 more variables: Unit <chr>, Value <dbl>,
# Flag <chr>, `Flag Description` <chr>, and abbreviated variable names
# ¹`Domain Code`, ²`Area Code`, ³`Element Code`, ⁴`Item Code`, ⁵`Year Code`
# A tibble: 6 × 14
Domai…¹ Domain Area …² Area Eleme…³ Element Item …⁴ Item Year …⁵ Year Unit
<chr> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl> <chr>
1 QA Live … 2 Afgh… 5112 Stocks 1057 Chic… 1961 1961 1000…
2 QA Live … 2 Afgh… 5112 Stocks 1057 Chic… 1962 1962 1000…
3 QA Live … 2 Afgh… 5112 Stocks 1057 Chic… 1963 1963 1000…
4 QA Live … 2 Afgh… 5112 Stocks 1057 Chic… 1964 1964 1000…
5 QA Live … 2 Afgh… 5112 Stocks 1057 Chic… 1965 1965 1000…
6 QA Live … 2 Afgh… 5112 Stocks 1057 Chic… 1966 1966 1000…
# … with 3 more variables: Value <dbl>, Flag <chr>, `Flag Description` <chr>,
# and abbreviated variable names ¹`Domain Code`, ²`Area Code`,
# ³`Element Code`, ⁴`Item Code`, ⁵`Year Code`
Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.
Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).
Domain Code Domain Area Code Area
Length:30977 Length:30977 Min. : 1 Length:30977
Class :character Class :character 1st Qu.: 79 Class :character
Mode :character Mode :character Median : 156 Mode :character
Mean :1202
3rd Qu.: 231
Max. :5504
Element Code Element Item Code Item
Min. :5112 Length:30977 Min. :1057 Length:30977
1st Qu.:5112 Class :character 1st Qu.:1057 Class :character
Median :5112 Mode :character Median :1068 Mode :character
Mean :5112 Mean :1066
3rd Qu.:5112 3rd Qu.:1072
Max. :5112 Max. :1083
Year Code Year Unit Value
Min. :1961 Min. :1961 Length:30977 Min. : 0
1st Qu.:1976 1st Qu.:1976 Class :character 1st Qu.: 171
Median :1992 Median :1992 Mode :character Median : 1800
Mean :1991 Mean :1991 Mean : 99411
3rd Qu.:2005 3rd Qu.:2005 3rd Qu.: 15404
Max. :2018 Max. :2018 Max. :23707134
NA's :1036
Flag Flag Description
Length:30977 Length:30977
Class :character Class :character
Mode :character Mode :character
[1] 30977 14
[1] "Domain Code" "Domain" "Area Code" "Area"
[5] "Element Code" "Element" "Item Code" "Item"
[9] "Year Code" "Year" "Unit" "Value"
[13] "Flag" "Flag Description"
[1] 30977
5
248
Conduct some exploratory data analysis, using dplyr commands such as group_by()
, select()
, filter()
, and summarise()
. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
[1] 248
The above shows the numbeer of distinct areas
[1] 58
The above shows the numbeer of distinct years
[1] "Data grouped by year:"
Data filtered to include only chickens:
Error in summarise(., mean_stocks = mean(Value, na.rm = TRUE), median_stocks = median(Value, : object 'ChickensOnly' not found
Mean and Median Valyes of the chicken stocks
Error in view(ChickenSummary): object 'ChickenSummary' not found
GroupByArea <- data %>%
group_by(`Area`) %>%
summarise(mean_stock_value = mean(Value, na.rm=TRUE),
median_stock_value = median(Value, na.rm=TRUE),
stock_value_sd = sd(Value, na.rm=TRUE),
min_stock_value = min(Value, na.rm=TRUE),
max_stock_value = max(Value, na.rm=TRUE),
first_quartile_stock_value = quantile(Value, 0.25, na.rm=TRUE),
third_quartile_stock_value = quantile(Value, 0.75, na.rm=TRUE))
print(GroupByArea)
# A tibble: 248 × 8
Area mean_st…¹ media…² stock…³ min_s…⁴ max_s…⁵ first…⁶ third…⁷
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan 8099. 6700 2.82e3 4700 14414 6222. 10565
2 Africa 196561. 12910. 4.36e5 1213 1897326 5926. 25395
3 Albania 2278. 1300 2.27e3 200 9494 520 3752
4 Algeria 17621. 42.5 3.88e4 10 136078 24 2072
5 American Samoa 41.4 38 1.40e1 24 80 35 40
6 Americas 856356. 66924. 1.54e6 553 5796289 7497 552649
7 Angola 9453. 6075 8.93e3 3400 36500 4925 6930
8 Antigua and Barbuda 93.6 85 3.97e1 43 160 60.2 130
9 Argentina 18844. 2355 3.36e4 75 118300 530 10350
10 Armenia 2062. 1528. 2.04e3 120 8934 209. 3865
# … with 238 more rows, and abbreviated variable names ¹mean_stock_value,
# ²median_stock_value, ³stock_value_sd, ⁴min_stock_value, ⁵max_stock_value,
# ⁶first_quartile_stock_value, ⁷third_quartile_stock_value
view(GroupByArea)
GroupByYear <- data %>%
group_by(`Year`) %>%
summarise(mean_stock_value = mean(Value, na.rm=TRUE),
median_stock_value = median(Value, na.rm=TRUE),
stock_value_sd = sd(Value, na.rm=TRUE),
min_stock_value = min(Value, na.rm=TRUE),
max_stock_value = max(Value, na.rm=TRUE),
first_quartile_stock_value = quantile(Value, 0.25, na.rm=TRUE),
third_quartile_stock_value = quantile(Value, 0.75, na.rm=TRUE))
print(GroupByYear)
# A tibble: 58 × 8
Year mean_stock_value median_stock…¹ stock…² min_s…³ max_s…⁴ first…⁵ third…⁶
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1961 36752. 1033 216931. 1 3906690 102. 6223.
2 1962 37787. 1014 224935. 1 4048728 105. 6441.
3 1963 38736. 1106 230985. 1 4163131 106. 6530.
4 1964 39325. 1103 234108. 0 4231221 106. 6818
5 1965 40334. 1104 240537. 0 4349674 106 7000
6 1966 41229. 1088. 245576. 0 4445629 109. 7463.
7 1967 43240. 1193 257592. 0 4666511 119 7935.
8 1968 44420. 1252. 265750. 0 4823170 117. 7915
9 1969 45607. 1267 273871. 0 4988438 106 8255
10 1970 47706. 1259 285751. 0 5209733 119 8676.
# … with 48 more rows, and abbreviated variable names ¹median_stock_value,
# ²stock_value_sd, ³min_stock_value, ⁴max_stock_value,
# ⁵first_quartile_stock_value, ⁶third_quartile_stock_value
Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.
The goal was to analyze the sizes of the five livestock categories within the dataset and determine the poultry quantities in different countries within the specified time period. Through a specific grouping approach by year, I observed a notable disparity between the presence of chickens and other types of poultry in these countries. It became evident that the countries had significantly higher numbers of chicken livestock compared to other poultry varieties. By grouping the data by year, we can explore interesting trends in chicken stocks over time. The chicken_summary
and chicken_range
tables provide valuable insights into the central tendency and dispersion of chicken stocks across different years. These tables reveal the highest and lowest mean, median, and mode chicken stocks, as well as the minimum, maximum, and quartile ranges. Analyzing the trends over time can help us identify any interesting patterns and understand how chicken stocks have varied throughout the specified time period.
---
title: "Challenge 2_PriyankaThatikonda"
author: "Meredith Rolfe"
description: "Data wrangling: using group() and summarise()"
date: "08/16/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_2
- railroads
- faostat
- hotel_bookings
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to
1) read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
2) provide summary statistics for different interesting groups within the data, and interpret those statistics
## Read in the Data
Read in one (or more) of the following data sets, available in the `posts/_data` folder, using the correct R package and command.
- railroad\*.csv or StateCounty2012.xls ⭐
- FAOstat\*.csv or birds.csv ⭐⭐⭐
- hotel_bookings.csv ⭐⭐⭐⭐
```{r}
data <- read_csv("_data/birds.csv")
print(data,show_col_types = FALSE)
head(data)
```
Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.
## Describe the data
Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).
```{r}
#| label: summary
summary(data)
dim(data)
colnames(data)
nrow(data)
num_unique_items <- length(unique(data$Item))
message(sprintf("There are %d unique varieties of animals present in the dataset\n", num_unique_items))
cat(paste(num_unique_items, collapse = ", "))
num_unique_items <- length(unique(data$Area))
message("Total number of areas present in the dataset:\n", num_unique_items)
cat(paste(num_unique_items, collapse = ", "))
```
## Provide Grouped Summary Statistics
Conduct some exploratory data analysis, using dplyr commands such as `group_by()`, `select()`, `filter()`, and `summarise()`. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
```{r}
data%>%
select(`Area`) %>%
n_distinct(.)
cat("The above shows the numbeer of distinct areas\n")
data%>%
select(`Year`) %>%
n_distinct(.)
cat("The above shows the numbeer of distinct years\n")
library(dplyr)
BirdsGrouped <- data %>%
group_by(Year)
print("Data grouped by year:")
view(BirdsGrouped)
ChickenOnly <- BirdsGrouped %>%
filter(Item == "Chickens")
cat("Data filtered to include only chickens:\n")
view(ChickenOnly)
ChickenSummary <- ChickensOnly %>%
summarise(
mean_stocks = mean(Value, na.rm = TRUE),
median_stocks = median(Value, na.rm = TRUE),
mode_stocks = Value[which.max(tabulate(match(Value, unique(Value))))])
cat("Mean and Median Valyes of the chicken stocks\n")
view(ChickenSummary)
GroupByArea <- data %>%
group_by(`Area`) %>%
summarise(mean_stock_value = mean(Value, na.rm=TRUE),
median_stock_value = median(Value, na.rm=TRUE),
stock_value_sd = sd(Value, na.rm=TRUE),
min_stock_value = min(Value, na.rm=TRUE),
max_stock_value = max(Value, na.rm=TRUE),
first_quartile_stock_value = quantile(Value, 0.25, na.rm=TRUE),
third_quartile_stock_value = quantile(Value, 0.75, na.rm=TRUE))
print(GroupByArea)
view(GroupByArea)
GroupByYear <- data %>%
group_by(`Year`) %>%
summarise(mean_stock_value = mean(Value, na.rm=TRUE),
median_stock_value = median(Value, na.rm=TRUE),
stock_value_sd = sd(Value, na.rm=TRUE),
min_stock_value = min(Value, na.rm=TRUE),
max_stock_value = max(Value, na.rm=TRUE),
first_quartile_stock_value = quantile(Value, 0.25, na.rm=TRUE),
third_quartile_stock_value = quantile(Value, 0.75, na.rm=TRUE))
print(GroupByYear)
view(GroupByYear)
```
### Explain and Interpret
Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.
The goal was to analyze the sizes of the five livestock categories within the dataset and determine the poultry quantities in different countries within the specified time period. Through a specific grouping approach by year, I observed a notable disparity between the presence of chickens and other types of poultry in these countries. It became evident that the countries had significantly higher numbers of chicken livestock compared to other poultry varieties. By grouping the data by year, we can explore interesting trends in chicken stocks over time. The `chicken_summary` and `chicken_range` tables provide valuable insights into the central tendency and dispersion of chicken stocks across different years. These tables reveal the highest and lowest mean, median, and mode chicken stocks, as well as the minimum, maximum, and quartile ranges. Analyzing the trends over time can help us identify any interesting patterns and understand how chicken stocks have varied throughout the specified time period.