Code
library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Harsha Kanaka Eswar Gudipudi
May 15, 2023
Today’s challenge is to
Read in one (or more) of the following data sets, available in the posts/_data
folder, using the correct R package and command.
# A tibble: 6 × 14
`Domain Code` Domain `Area Code` Area `Element Code` Element `Item Code`
<chr> <chr> <dbl> <chr> <dbl> <chr> <dbl>
1 QA Live Anima… 2 Afgh… 5112 Stocks 1057
2 QA Live Anima… 2 Afgh… 5112 Stocks 1057
3 QA Live Anima… 2 Afgh… 5112 Stocks 1057
4 QA Live Anima… 2 Afgh… 5112 Stocks 1057
5 QA Live Anima… 2 Afgh… 5112 Stocks 1057
6 QA Live Anima… 2 Afgh… 5112 Stocks 1057
# ℹ 7 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
# Value <dbl>, Flag <chr>, `Flag Description` <chr>
Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.
Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).
Description : This data description presents information on live bird stocks like chicken, geese,etc in different areas such as Afganisthan, Albania, Algeria..etc, for the years from 1961 to 2018.
Domain Code Domain Area Code Area
Length:30977 Length:30977 Min. : 1 Length:30977
Class :character Class :character 1st Qu.: 79 Class :character
Mode :character Mode :character Median : 156 Mode :character
Mean :1202
3rd Qu.: 231
Max. :5504
Element Code Element Item Code Item
Min. :5112 Length:30977 Min. :1057 Length:30977
1st Qu.:5112 Class :character 1st Qu.:1057 Class :character
Median :5112 Mode :character Median :1068 Mode :character
Mean :5112 Mean :1066
3rd Qu.:5112 3rd Qu.:1072
Max. :5112 Max. :1083
Year Code Year Unit Value
Min. :1961 Min. :1961 Length:30977 Min. : 0
1st Qu.:1976 1st Qu.:1976 Class :character 1st Qu.: 171
Median :1992 Median :1992 Mode :character Median : 1800
Mean :1991 Mean :1991 Mean : 99411
3rd Qu.:2005 3rd Qu.:2005 3rd Qu.: 15404
Max. :2018 Max. :2018 Max. :23707134
NA's :1036
Flag Flag Description
Length:30977 Length:30977
Class :character Class :character
Mode :character Mode :character
[1] 30977 14
All varitey of live stocks in the data:
There are 5 unique Varitey of animals present in the dataset
Chickens, Ducks, Geese and guinea fowls, Turkeys, Pigeons, other birds
The data is collected from various areas like:
Total no of areas: 248
Afghanistan, Albania, Algeria, American Samoa, Angola, Antigua and Barbuda, Argentina, Armenia, Aruba, Australia, Austria, Azerbaijan, Bahamas, Bahrain, Bangladesh, Barbados, Belarus, Belgium, Belgium-Luxembourg, Belize, Benin, Bermuda, Bhutan, Bolivia (Plurinational State of), Bosnia and Herzegovina, Botswana, Brazil, Brunei Darussalam, Bulgaria, Burkina Faso, Burundi, Cabo Verde, Cambodia, Cameroon, Canada, Cayman Islands, Central African Republic, Chad, Chile, China, Hong Kong SAR, China, Macao SAR, China, mainland, China, Taiwan Province of, Colombia, Comoros, Congo, Cook Islands, Costa Rica, Côte d'Ivoire, Croatia, Cuba, Cyprus, Czechia, Czechoslovakia, Democratic People's Republic of Korea, Democratic Republic of the Congo, Denmark, Dominica, Dominican Republic, Ecuador, Egypt, El Salvador, Equatorial Guinea, Eritrea, Estonia, Eswatini, Ethiopia, Ethiopia PDR, Falkland Islands (Malvinas), Fiji, Finland, France, French Guyana, French Polynesia, Gabon, Gambia, Georgia, Germany, Ghana, Greece, Grenada, Guadeloupe, Guam, Guatemala, Guinea, Guinea-Bissau, Guyana, Haiti, Honduras, Hungary, Iceland, India, Indonesia, Iran (Islamic Republic of), Iraq, Ireland, Israel, Italy, Jamaica, Japan, Jordan, Kazakhstan, Kenya, Kiribati, Kuwait, Kyrgyzstan, Lao People's Democratic Republic, Latvia, Lebanon, Lesotho, Liberia, Libya, Liechtenstein, Lithuania, Luxembourg, Madagascar, Malawi, Malaysia, Mali, Malta, Martinique, Mauritania, Mauritius, Mexico, Micronesia (Federated States of), Mongolia, Montenegro, Montserrat, Morocco, Mozambique, Myanmar, Namibia, Nauru, Nepal, Netherlands, Netherlands Antilles (former), New Caledonia, New Zealand, Nicaragua, Niger, Nigeria, Niue, North Macedonia, Norway, Oman, Pacific Islands Trust Territory, Pakistan, Palestine, Panama, Papua New Guinea, Paraguay, Peru, Philippines, Poland, Portugal, Puerto Rico, Qatar, Republic of Korea, Republic of Moldova, Réunion, Romania, Russian Federation, Rwanda, Saint Helena, Ascension and Tristan da Cunha, Saint Kitts and Nevis, Saint Lucia, Saint Pierre and Miquelon, Saint Vincent and the Grenadines, Samoa, Sao Tome and Principe, Saudi Arabia, Senegal, Serbia, Serbia and Montenegro, Seychelles, Sierra Leone, Singapore, Slovakia, Slovenia, Solomon Islands, Somalia, South Africa, South Sudan, Spain, Sri Lanka, Sudan, Sudan (former), Suriname, Sweden, Switzerland, Syrian Arab Republic, Tajikistan, Thailand, Timor-Leste, Togo, Tokelau, Tonga, Trinidad and Tobago, Tunisia, Turkey, Turkmenistan, Tuvalu, Uganda, Ukraine, United Arab Emirates, United Kingdom of Great Britain and Northern Ireland, United Republic of Tanzania, United States of America, United States Virgin Islands, Uruguay, USSR, Uzbekistan, Vanuatu, Venezuela (Bolivarian Republic of), Viet Nam, Wallis and Futuna Islands, Yemen, Yugoslav SFR, Zambia, Zimbabwe, World, Africa, Eastern Africa, Middle Africa, Northern Africa, Southern Africa, Western Africa, Americas, Northern America, Central America, Caribbean, South America, Asia, Central Asia, Eastern Asia, Southern Asia, South-eastern Asia, Western Asia, Europe, Eastern Europe, Northern Europe, Southern Europe, Western Europe, Oceania, Australia and New Zealand, Melanesia, Micronesia, Polynesia
Conduct some exploratory data analysis, using dplyr commands such as group_by()
, select()
, filter()
, and summarise()
. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
[1] "Data grouped by year:"
# A tibble: 30,977 × 14
# Groups: Year [58]
`Domain Code` Domain `Area Code` Area `Element Code` Element `Item Code`
<chr> <chr> <dbl> <chr> <dbl> <chr> <dbl>
1 QA Live Anim… 2 Afgh… 5112 Stocks 1057
2 QA Live Anim… 2 Afgh… 5112 Stocks 1057
3 QA Live Anim… 2 Afgh… 5112 Stocks 1057
4 QA Live Anim… 2 Afgh… 5112 Stocks 1057
5 QA Live Anim… 2 Afgh… 5112 Stocks 1057
6 QA Live Anim… 2 Afgh… 5112 Stocks 1057
7 QA Live Anim… 2 Afgh… 5112 Stocks 1057
8 QA Live Anim… 2 Afgh… 5112 Stocks 1057
9 QA Live Anim… 2 Afgh… 5112 Stocks 1057
10 QA Live Anim… 2 Afgh… 5112 Stocks 1057
# ℹ 30,967 more rows
# ℹ 7 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
# Value <dbl>, Flag <chr>, `Flag Description` <chr>
[1] "Data filtered to include only chickens:"
# A tibble: 13,074 × 14
# Groups: Year [58]
`Domain Code` Domain `Area Code` Area `Element Code` Element `Item Code`
<chr> <chr> <dbl> <chr> <dbl> <chr> <dbl>
1 QA Live Anim… 2 Afgh… 5112 Stocks 1057
2 QA Live Anim… 2 Afgh… 5112 Stocks 1057
3 QA Live Anim… 2 Afgh… 5112 Stocks 1057
4 QA Live Anim… 2 Afgh… 5112 Stocks 1057
5 QA Live Anim… 2 Afgh… 5112 Stocks 1057
6 QA Live Anim… 2 Afgh… 5112 Stocks 1057
7 QA Live Anim… 2 Afgh… 5112 Stocks 1057
8 QA Live Anim… 2 Afgh… 5112 Stocks 1057
9 QA Live Anim… 2 Afgh… 5112 Stocks 1057
10 QA Live Anim… 2 Afgh… 5112 Stocks 1057
# ℹ 13,064 more rows
# ℹ 7 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
# Value <dbl>, Flag <chr>, `Flag Description` <chr>
# Calculate the mean, median, and mode of chicken stocks for each year
chicken_summary <- chickens_only %>%
summarise(mean_stocks = mean(Value, na.rm = TRUE),
median_stocks = median(Value, na.rm = TRUE),
mode_stocks = as.numeric(names(sort(-table(Value))))[1])
print("Summary statistics of chicken stocks:")
[1] "Summary statistics of chicken stocks:"
# A tibble: 58 × 4
Year mean_stocks median_stocks mode_stocks
<dbl> <dbl> <dbl> <dbl>
1 1961 74060. 4184 1400
2 1962 76753. 4300 10
3 1963 78922. 4500 2
4 1964 80213. 4600 10
5 1965 82458. 4930 10
6 1966 83880. 5208. 45
7 1967 88047. 5056 50
8 1968 91003. 5250 60
9 1969 94121. 6000 40
10 1970 98297. 6070. 90
# ℹ 48 more rows
# Calculate the minimum, maximum, and quantiles of chicken stocks for each year
chicken_range <- chickens_only %>%
summarise(min_stocks = min(Value, na.rm = TRUE),
max_stocks = max(Value, na.rm = TRUE),
q1_stocks = quantile(Value, 0.25, na.rm = TRUE),
q3_stocks = quantile(Value, 0.75, na.rm = TRUE))
print("Range of chicken stocks:")
[1] "Range of chicken stocks:"
# A tibble: 58 × 5
Year min_stocks max_stocks q1_stocks q3_stocks
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1961 1 3906690 314. 22016
2 1962 1 4048728 326. 22298
3 1963 2 4163131 330 25000
4 1964 2 4231221 372 25305
5 1965 2 4349674 392. 26000
6 1966 2 4445629 377. 24091.
7 1967 2 4666511 401. 26213.
8 1968 2 4823170 428. 25729.
9 1969 2 4988438 450 26237
10 1970 2 5209733 488. 28497.
# ℹ 48 more rows
Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.
Here, I choose to group the data by year to see if there are any interesting trends in chicken stocks over time .The chicken_summary and chicken_range tables provide useful insights into the central tendency and dispersion of chicken stocks over years. we can see the highest and lowest mean, median, and mode chicken stocks, as well as the minimum, maximum, and quartile ranges. We can also compare the trends across different years and see if there are any interesting patterns.
---
title: "Challenge 2"
author: "Harsha Kanaka Eswar Gudipudi"
description: "Data wrangling: using group() and summarise()"
date: "05/15/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_2
- Harsha Kanaka Eswar Gudipudi
- birds.csv
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Challenge Overview
Today's challenge is to
1) read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
2) provide summary statistics for different interesting groups within the data, and interpret those statistics
## Read in the Data
Read in one (or more) of the following data sets, available in the `posts/_data` folder, using the correct R package and command.
- railroad\*.csv or StateCounty2012.xls ⭐
- FAOstat\*.csv or birds.csv ⭐⭐⭐
- hotel_bookings.csv ⭐⭐⭐⭐
```{r}
df <- read_csv('_data/birds.csv', show_col_types = FALSE)
head(df)
```
Add any comments or documentation as needed. More challenging data may require additional code chunks and documentation.
## Describe the data
Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).
Description : This data description presents information on live bird stocks like chicken, geese,etc in different areas such as Afganisthan, Albania, Algeria..etc, for the years from 1961 to 2018.
```{r}
#| label: summary
summary(df)
dim(df)
```
All varitey of live stocks in the data:
```{r}
unique_items <- unique(df$Item)
num_unique_items <- length(unique_items)
cat(paste("There are", num_unique_items, "unique Varitey of animals present in the dataset"))
cat(paste(unique_items, collapse = ", "))
```
The data is collected from various areas like:
```{r}
unique_items <- unique(df$Area)
num_unique_items <- length(unique_items)
cat(paste("Total no of areas: ", num_unique_items))
cat(paste(unique_items, collapse = ", "))
```
## Provide Grouped Summary Statistics
Conduct some exploratory data analysis, using dplyr commands such as `group_by()`, `select()`, `filter()`, and `summarise()`. Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
```{r}
library(dplyr)
birds_grouped <- df %>%
group_by(Year)
print("Data grouped by year:")
print(birds_grouped)
# Filter the data to include only chickens
chickens_only <- birds_grouped %>%
filter(Item == "Chickens")
print("Data filtered to include only chickens:")
print(chickens_only)
# Calculate the mean, median, and mode of chicken stocks for each year
chicken_summary <- chickens_only %>%
summarise(mean_stocks = mean(Value, na.rm = TRUE),
median_stocks = median(Value, na.rm = TRUE),
mode_stocks = as.numeric(names(sort(-table(Value))))[1])
print("Summary statistics of chicken stocks:")
print(chicken_summary)
# Calculate the minimum, maximum, and quantiles of chicken stocks for each year
chicken_range <- chickens_only %>%
summarise(min_stocks = min(Value, na.rm = TRUE),
max_stocks = max(Value, na.rm = TRUE),
q1_stocks = quantile(Value, 0.25, na.rm = TRUE),
q3_stocks = quantile(Value, 0.75, na.rm = TRUE))
print("Range of chicken stocks:")
print(chicken_range)
```
### Explain and Interpret
Be sure to explain why you choose a specific group. Comment on the interpretation of any interesting differences between groups that you uncover. This section can be integrated with the exploratory data analysis, just be sure it is included.
Here, I choose to group the data by year to see if there are any interesting trends in chicken stocks over time .The chicken_summary and chicken_range tables provide useful insights into the central tendency and dispersion of chicken stocks over years. we can see the highest and lowest mean, median, and mode chicken stocks, as well as the minimum, maximum, and quartile ranges. We can also compare the trends across different years and see if there are any interesting patterns.