challenge_2
railroads
faostat
hotel_bookings
Data wrangling: using group() and summarise()
Author

Noah Dixon

Published

June 5, 2023

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Part 1: Read in Data

Using the read_csv function we can load the birds.csv data from the file.

Code
birds_from_csv <- read_csv("_data/birds.csv")

Part 2: Describe the data

In order to get a better sense of what the data looks like, we can print the first 6 rows of the data using the head function.

Code
head(birds_from_csv)

From the above output we can see that there are several redundant columns which all contain the word “Code”. We can remove these columns to make the data more tidy.

Code
birds_without_redundancy <- birds_from_csv %>%
  select(-contains("Code"))

Now, we can check the remaining columns to see how many unique values each column has. We will use the sapply function with the length and unique functions to generate the number of unique entries in each columns

Code
print(sapply(birds_without_redundancy, function(x) length(unique(x))))
          Domain             Area          Element             Item 
               1              248                1                5 
            Year             Unit            Value             Flag 
              58                1            11496                6 
Flag Description 
               6 

We can see that the Domain, Element, and Unit columns all contain only 1 unique value across all rows. This means they essentially provide no valuable information for a given case, and can be removed from the data set. Additionally, from the head() output above we know that the Flag Description column correlates exactly to the Flag column. Therefore, we can remove it as well since it doesn’t provide additional information. We will print the results to see the new data.

Code
birds_tidy <- birds_without_redundancy %>%
  select(Area, Item, Year, Value, Flag)
birds_tidy

With the new tidy data, we still need to remove cases with the flag “A”, as this indicates that the area is not a country but a larger region. Therefore, these cases do not belong in the same data set as the cases for single countries.

Code
birds_tidy_final <- birds_tidy %>%
  filter(Flag != "A" | is.na(Flag))
dim(birds_tidy_final)
[1] 24489     5

We can see that the dimension of the new birds_tidy_final data frame is smaller than the birds_tidy data frame, indicating that we have removed rows where the flag is equal to “A”. Now we have tidied data, where each case represents the number of birds purchased in a single country for a given year.

Part 3: Provide Grouped Summary Statistics

First, lets generate some interesting groups for this data. We can use the Area, Year, and Item columns to group the data

Code
birds_by_country <- birds_tidy_final %>%
  group_by(Area)
  
birds_by_year <- birds_tidy_final %>%
  group_by(Year)
  
birds_by_type <- birds_tidy_final %>%
  group_by(Item)

Now, lets examine the created groups by generating statistics for each group. We will start with the cases grouped by country.

Code
country_group_stats <- summarize(birds_by_country,
                                 mean = mean(Value, na.rm = TRUE),
                                 median = median(Value, na.rm = TRUE),
                                 min = min(Value, na.rm = TRUE),
                                 max = max(Value, na.rm = TRUE),
                                 sd = sd(Value, na.rm = TRUE),
                                 var = var(Value, na.rm = TRUE),
                                 IQR = IQR(Value, na.rm = TRUE))
arrange(country_group_stats, desc(mean))

Let’s also examine cases grouped by year.

Code
year_group_stats <- summarize(birds_by_year,
                              mean = mean(Value, na.rm = TRUE),
                              median = median(Value, na.rm = TRUE),
                              min = min(Value, na.rm = TRUE),
                              max = max(Value, na.rm = TRUE),
                              sd = sd(Value, na.rm = TRUE),
                              var = var(Value, na.rm = TRUE),
                              IQR = IQR(Value, na.rm = TRUE))
arrange(year_group_stats, desc(mean))

And finally lets examine cases grouped by bird type.

Code
type_group_stats <- summarize(birds_by_type,
                              mean = mean(Value, na.rm = TRUE),
                              median = median(Value, na.rm = TRUE),
                              min = min(Value, na.rm = TRUE),
                              max = max(Value, na.rm = TRUE),
                              sd = sd(Value, na.rm = TRUE),
                              var = var(Value, na.rm = TRUE),
                              IQR = IQR(Value, na.rm = TRUE))
arrange(type_group_stats, desc(mean))

Part 4: Explain and Interpret

Grouping the data based on country, year, and bird type yields some interesting results because these are the 3 main categorical variables represented by the data set. As we can see from the generated statistics, China, the US, and the USSR had the highest median and average birds sold over the the time period. Also, we can see that the average number of birds sold per year worldwide increased steadily over the time period covered by the data (1961 - 2018). Finally, we can see that chickens were by far the highest bird sold over the time frame, with a maximum value of 5,274,475 * 1000 heads sold in one year.