Challenge 2

challenge_2

Harsha Kanaka Eswar Gudipudi

birds.csv

Data wrangling: using group() and summarise()

Author

Harsha Kanaka Eswar Gudipudi

Published

May 15, 2023

Code

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to

read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
provide summary statistics for different interesting groups within the data, and interpret those statistics

Read in the Data

Read in one (or more) of the following data sets, available in the posts/_data folder, using the correct R package and command.

railroad*.csv or StateCounty2012.xls ⭐
FAOstat*.csv or birds.csv ⭐⭐⭐
hotel_bookings.csv ⭐⭐⭐⭐

Code

df <- read_csv('_data/birds.csv', show_col_types = FALSE)
head(df)

# A tibble: 6 × 14
  `Domain Code` Domain      `Area Code` Area  `Element Code` Element `Item Code`
  <chr>         <chr>             <dbl> <chr>          <dbl> <chr>         <dbl>
1 QA            Live Anima…           2 Afgh…           5112 Stocks         1057
2 QA            Live Anima…           2 Afgh…           5112 Stocks         1057
3 QA            Live Anima…           2 Afgh…           5112 Stocks         1057
4 QA            Live Anima…           2 Afgh…           5112 Stocks         1057
5 QA            Live Anima…           2 Afgh…           5112 Stocks         1057
6 QA            Live Anima…           2 Afgh…           5112 Stocks         1057
# ℹ 7 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
#   Value <dbl>, Flag <chr>, `Flag Description` <chr>

Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.

Describe the data

Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).

Description : This data description presents information on live bird stocks like chicken, geese,etc in different areas such as Afganisthan, Albania, Algeria..etc, for the years from 1961 to 2018.

Code

summary(df)

 Domain Code           Domain            Area Code        Area          
 Length:30977       Length:30977       Min.   :   1   Length:30977      
 Class :character   Class :character   1st Qu.:  79   Class :character  
 Mode  :character   Mode  :character   Median : 156   Mode  :character  
                                       Mean   :1202                     
                                       3rd Qu.: 231                     
                                       Max.   :5504                     
                                                                        
  Element Code    Element            Item Code        Item          
 Min.   :5112   Length:30977       Min.   :1057   Length:30977      
 1st Qu.:5112   Class :character   1st Qu.:1057   Class :character  
 Median :5112   Mode  :character   Median :1068   Mode  :character  
 Mean   :5112                      Mean   :1066                     
 3rd Qu.:5112                      3rd Qu.:1072                     
 Max.   :5112                      Max.   :1083                     
                                                                    
   Year Code         Year          Unit               Value         
 Min.   :1961   Min.   :1961   Length:30977       Min.   :       0  
 1st Qu.:1976   1st Qu.:1976   Class :character   1st Qu.:     171  
 Median :1992   Median :1992   Mode  :character   Median :    1800  
 Mean   :1991   Mean   :1991                      Mean   :   99411  
 3rd Qu.:2005   3rd Qu.:2005                      3rd Qu.:   15404  
 Max.   :2018   Max.   :2018                      Max.   :23707134  
                                                  NA's   :1036      
     Flag           Flag Description  
 Length:30977       Length:30977      
 Class :character   Class :character  
 Mode  :character   Mode  :character

Code

dim(df)

[1] 30977    14

All varitey of live stocks in the data:

Code

unique_items <- unique(df$Item)
num_unique_items <- length(unique_items)
cat(paste("There are", num_unique_items, "unique Varitey of animals present in the dataset"))

There are 5 unique Varitey of animals present in the dataset

Code

cat(paste(unique_items, collapse = ", "))

Chickens, Ducks, Geese and guinea fowls, Turkeys, Pigeons, other birds

The data is collected from various areas like:

Code

unique_items <- unique(df$Area)
num_unique_items <- length(unique_items)
cat(paste("Total no of areas: ", num_unique_items))

Total no of areas:  248

Code

cat(paste(unique_items, collapse = ", "))

Afghanistan, Albania, Algeria, American Samoa, Angola, Antigua and Barbuda, Argentina, Armenia, Aruba, Australia, Austria, Azerbaijan, Bahamas, Bahrain, Bangladesh, Barbados, Belarus, Belgium, Belgium-Luxembourg, Belize, Benin, Bermuda, Bhutan, Bolivia (Plurinational State of), Bosnia and Herzegovina, Botswana, Brazil, Brunei Darussalam, Bulgaria, Burkina Faso, Burundi, Cabo Verde, Cambodia, Cameroon, Canada, Cayman Islands, Central African Republic, Chad, Chile, China, Hong Kong SAR, China, Macao SAR, China, mainland, China, Taiwan Province of, Colombia, Comoros, Congo, Cook Islands, Costa Rica, Côte d'Ivoire, Croatia, Cuba, Cyprus, Czechia, Czechoslovakia, Democratic People's Republic of Korea, Democratic Republic of the Congo, Denmark, Dominica, Dominican Republic, Ecuador, Egypt, El Salvador, Equatorial Guinea, Eritrea, Estonia, Eswatini, Ethiopia, Ethiopia PDR, Falkland Islands (Malvinas), Fiji, Finland, France, French Guyana, French Polynesia, Gabon, Gambia, Georgia, Germany, Ghana, Greece, Grenada, Guadeloupe, Guam, Guatemala, Guinea, Guinea-Bissau, Guyana, Haiti, Honduras, Hungary, Iceland, India, Indonesia, Iran (Islamic Republic of), Iraq, Ireland, Israel, Italy, Jamaica, Japan, Jordan, Kazakhstan, Kenya, Kiribati, Kuwait, Kyrgyzstan, Lao People's Democratic Republic, Latvia, Lebanon, Lesotho, Liberia, Libya, Liechtenstein, Lithuania, Luxembourg, Madagascar, Malawi, Malaysia, Mali, Malta, Martinique, Mauritania, Mauritius, Mexico, Micronesia (Federated States of), Mongolia, Montenegro, Montserrat, Morocco, Mozambique, Myanmar, Namibia, Nauru, Nepal, Netherlands, Netherlands Antilles (former), New Caledonia, New Zealand, Nicaragua, Niger, Nigeria, Niue, North Macedonia, Norway, Oman, Pacific Islands Trust Territory, Pakistan, Palestine, Panama, Papua New Guinea, Paraguay, Peru, Philippines, Poland, Portugal, Puerto Rico, Qatar, Republic of Korea, Republic of Moldova, Réunion, Romania, Russian Federation, Rwanda, Saint Helena, Ascension and Tristan da Cunha, Saint Kitts and Nevis, Saint Lucia, Saint Pierre and Miquelon, Saint Vincent and the Grenadines, Samoa, Sao Tome and Principe, Saudi Arabia, Senegal, Serbia, Serbia and Montenegro, Seychelles, Sierra Leone, Singapore, Slovakia, Slovenia, Solomon Islands, Somalia, South Africa, South Sudan, Spain, Sri Lanka, Sudan, Sudan (former), Suriname, Sweden, Switzerland, Syrian Arab Republic, Tajikistan, Thailand, Timor-Leste, Togo, Tokelau, Tonga, Trinidad and Tobago, Tunisia, Turkey, Turkmenistan, Tuvalu, Uganda, Ukraine, United Arab Emirates, United Kingdom of Great Britain and Northern Ireland, United Republic of Tanzania, United States of America, United States Virgin Islands, Uruguay, USSR, Uzbekistan, Vanuatu, Venezuela (Bolivarian Republic of), Viet Nam, Wallis and Futuna Islands, Yemen, Yugoslav SFR, Zambia, Zimbabwe, World, Africa, Eastern Africa, Middle Africa, Northern Africa, Southern Africa, Western Africa, Americas, Northern America, Central America, Caribbean, South America, Asia, Central Asia, Eastern Asia, Southern Asia, South-eastern Asia, Western Asia, Europe, Eastern Europe, Northern Europe, Southern Europe, Western Europe, Oceania, Australia and New Zealand, Melanesia, Micronesia, Polynesia

Provide Grouped Summary Statistics

Conduct some exploratory data analysis, using dplyr commands such as group_by(), select(), filter(), and summarise(). Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.

Code

library(dplyr)

birds_grouped <- df %>%
  group_by(Year)
  
print("Data grouped by year:")

[1] "Data grouped by year:"

Code

print(birds_grouped)

# A tibble: 30,977 × 14
# Groups:   Year [58]
   `Domain Code` Domain     `Area Code` Area  `Element Code` Element `Item Code`
   <chr>         <chr>            <dbl> <chr>          <dbl> <chr>         <dbl>
 1 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 2 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 3 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 4 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 5 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 6 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 7 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 8 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 9 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
10 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
# ℹ 30,967 more rows
# ℹ 7 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
#   Value <dbl>, Flag <chr>, `Flag Description` <chr>

Code

# Filter the data to include only chickens
chickens_only <- birds_grouped %>%
  filter(Item == "Chickens")
  
print("Data filtered to include only chickens:")

[1] "Data filtered to include only chickens:"

Code

print(chickens_only)

# A tibble: 13,074 × 14
# Groups:   Year [58]
   `Domain Code` Domain     `Area Code` Area  `Element Code` Element `Item Code`
   <chr>         <chr>            <dbl> <chr>          <dbl> <chr>         <dbl>
 1 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 2 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 3 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 4 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 5 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 6 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 7 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 8 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 9 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
10 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
# ℹ 13,064 more rows
# ℹ 7 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
#   Value <dbl>, Flag <chr>, `Flag Description` <chr>

Code

# Calculate the mean, median, and mode of chicken stocks for each year
chicken_summary <- chickens_only %>%
  summarise(mean_stocks = mean(Value, na.rm = TRUE),
            median_stocks = median(Value, na.rm = TRUE),
            mode_stocks = as.numeric(names(sort(-table(Value))))[1])

print("Summary statistics of chicken stocks:")

[1] "Summary statistics of chicken stocks:"

Code

print(chicken_summary)

# A tibble: 58 × 4
    Year mean_stocks median_stocks mode_stocks
   <dbl>       <dbl>         <dbl>       <dbl>
 1  1961      74060.         4184         1400
 2  1962      76753.         4300           10
 3  1963      78922.         4500            2
 4  1964      80213.         4600           10
 5  1965      82458.         4930           10
 6  1966      83880.         5208.          45
 7  1967      88047.         5056           50
 8  1968      91003.         5250           60
 9  1969      94121.         6000           40
10  1970      98297.         6070.          90
# ℹ 48 more rows

Code

# Calculate the minimum, maximum, and quantiles of chicken stocks for each year
chicken_range <- chickens_only %>%
  summarise(min_stocks = min(Value, na.rm = TRUE),
            max_stocks = max(Value, na.rm = TRUE),
            q1_stocks = quantile(Value, 0.25, na.rm = TRUE),
            q3_stocks = quantile(Value, 0.75, na.rm = TRUE))

print("Range of chicken stocks:")

[1] "Range of chicken stocks:"

Code

print(chicken_range)

# A tibble: 58 × 5
    Year min_stocks max_stocks q1_stocks q3_stocks
   <dbl>      <dbl>      <dbl>     <dbl>     <dbl>
 1  1961          1    3906690      314.    22016 
 2  1962          1    4048728      326.    22298 
 3  1963          2    4163131      330     25000 
 4  1964          2    4231221      372     25305 
 5  1965          2    4349674      392.    26000 
 6  1966          2    4445629      377.    24091.
 7  1967          2    4666511      401.    26213.
 8  1968          2    4823170      428.    25729.
 9  1969          2    4988438      450     26237 
10  1970          2    5209733      488.    28497.
# ℹ 48 more rows

Explain and Interpret

Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.

Here, I choose to group the data by year to see if there are any interesting trends in chicken stocks over time .The chicken_summary and chicken_range tables provide useful insights into the central tendency and dispersion of chicken stocks over years. we can see the highest and lowest mean, median, and mode chicken stocks, as well as the minimum, maximum, and quartile ranges. We can also compare the trends across different years and see if there are any interesting patterns.