Code
library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Mariia Dubyk
October 5, 2022
Error in `rename()`:
! Can't rename columns that don't exist.
✖ Column `п.їDomain.Code` doesn't exist.
Error in `birds[, c("Domain", "Area", "Element", "Item", "Year", "Unit", "Value", "Flag",
"Flag.Description")]`:
! Can't subset columns that don't exist.
✖ Column `Flag.Description` doesn't exist.
Domain Code Domain Area Code Area
Length:30977 Length:30977 Min. : 1 Length:30977
Class :character Class :character 1st Qu.: 79 Class :character
Mode :character Mode :character Median : 156 Mode :character
Mean :1202
3rd Qu.: 231
Max. :5504
Element Code Element Item Code Item
Min. :5112 Length:30977 Min. :1057 Length:30977
1st Qu.:5112 Class :character 1st Qu.:1057 Class :character
Median :5112 Mode :character Median :1068 Mode :character
Mean :5112 Mean :1066
3rd Qu.:5112 3rd Qu.:1072
Max. :5112 Max. :1083
Year Code Year Unit Value
Min. :1961 Min. :1961 Length:30977 Min. : 0
1st Qu.:1976 1st Qu.:1976 Class :character 1st Qu.: 171
Median :1992 Median :1992 Mode :character Median : 1800
Mean :1991 Mean :1991 Mean : 99411
3rd Qu.:2005 3rd Qu.:2005 3rd Qu.: 15404
Max. :2018 Max. :2018 Max. :23707134
NA's :1036
Flag Flag Description
Length:30977 Length:30977
Class :character Class :character
Mode :character Mode :character
Data Frame Summary
birds
Dimensions: 30977 x 14
Duplicates: 0
----------------------------------------------------------------------------------------------------------------------------
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
---- ------------------ -------------------------------- ----------------------- ---------------------- ---------- ---------
1 Domain Code 1. QA 30977 (100.0%) IIIIIIIIIIIIIIIIIIII 30977 0
[character] (100.0%) (0.0%)
2 Domain 1. Live Animals 30977 (100.0%) IIIIIIIIIIIIIIIIIIII 30977 0
[character] (100.0%) (0.0%)
3 Area Code Mean (sd) : 1201.7 (2099.4) 248 distinct values : 30977 0
[numeric] min < med < max: : (100.0%) (0.0%)
1 < 156 < 5504 :
IQR (CV) : 152 (1.7) : .
: :
4 Area 1. Africa 290 ( 0.9%) 30977 0
[character] 2. Asia 290 ( 0.9%) (100.0%) (0.0%)
3. Eastern Asia 290 ( 0.9%)
4. Egypt 290 ( 0.9%)
5. Europe 290 ( 0.9%)
6. France 290 ( 0.9%)
7. Greece 290 ( 0.9%)
8. Myanmar 290 ( 0.9%)
9. Northern Africa 290 ( 0.9%)
10. South-eastern Asia 290 ( 0.9%)
[ 238 others ] 28077 (90.6%) IIIIIIIIIIIIIIIIII
5 Element Code 1 distinct value 5112 : 30977 (100.0%) IIIIIIIIIIIIIIIIIIII 30977 0
[numeric] (100.0%) (0.0%)
6 Element 1. Stocks 30977 (100.0%) IIIIIIIIIIIIIIIIIIII 30977 0
[character] (100.0%) (0.0%)
7 Item Code Mean (sd) : 1066.5 (9) 1057 : 13074 (42.2%) IIIIIIII 30977 0
[numeric] min < med < max: 1068 : 6909 (22.3%) IIII (100.0%) (0.0%)
1057 < 1068 < 1083 1072 : 4136 (13.4%) II
IQR (CV) : 15 (0) 1079 : 5693 (18.4%) III
1083 : 1165 ( 3.8%)
8 Item 1. Chickens 13074 (42.2%) IIIIIIII 30977 0
[character] 2. Ducks 6909 (22.3%) IIII (100.0%) (0.0%)
3. Geese and guinea fowls 4136 (13.4%) II
4. Pigeons, other birds 1165 ( 3.8%)
5. Turkeys 5693 (18.4%) III
9 Year Code Mean (sd) : 1990.6 (16.7) 58 distinct values . . . . : : : : 30977 0
[numeric] min < med < max: : : : . : : : : : : (100.0%) (0.0%)
1961 < 1992 < 2018 : : : : : : : : : :
IQR (CV) : 29 (0) : : : : : : : : : :
: : : : : : : : : :
10 Year Mean (sd) : 1990.6 (16.7) 58 distinct values . . . . : : : : 30977 0
[numeric] min < med < max: : : : . : : : : : : (100.0%) (0.0%)
1961 < 1992 < 2018 : : : : : : : : : :
IQR (CV) : 29 (0) : : : : : : : : : :
: : : : : : : : : :
11 Unit 1. 1000 Head 30977 (100.0%) IIIIIIIIIIIIIIIIIIII 30977 0
[character] (100.0%) (0.0%)
12 Value Mean (sd) : 99410.6 (720611.4) 11495 distinct values : 29941 1036
[numeric] min < med < max: : (96.7%) (3.3%)
0 < 1800 < 23707134 :
IQR (CV) : 15233 (7.2) :
:
13 Flag 1. * 1494 ( 7.4%) I 20204 10773
[character] 2. A 6488 (32.1%) IIIIII (65.2%) (34.8%)
3. F 10007 (49.5%) IIIIIIIII
4. Im 1213 ( 6.0%) I
5. M 1002 ( 5.0%)
14 Flag Description 1. Aggregate, may include of 6488 (20.9%) IIII 30977 0
[character] 2. Data not available 1002 ( 3.2%) (100.0%) (0.0%)
3. FAO data based on imputat 1213 ( 3.9%)
4. FAO estimate 10007 (32.3%) IIIIII
5. Official data 10773 (34.8%) IIIIII
6. Unofficial figure 1494 ( 4.8%)
----------------------------------------------------------------------------------------------------------------------------
# A tibble: 30,977 × 14
Domain Cod…¹ Domain Area …² Area Eleme…³ Element Item …⁴ Item Year …⁵ Year
<chr> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl>
1 QA Live … 5000 World 5112 Stocks 1057 Chic… 2018 2018
2 QA Live … 5000 World 5112 Stocks 1057 Chic… 2017 2017
3 QA Live … 5000 World 5112 Stocks 1057 Chic… 2016 2016
4 QA Live … 5000 World 5112 Stocks 1057 Chic… 2015 2015
5 QA Live … 5000 World 5112 Stocks 1057 Chic… 2014 2014
6 QA Live … 5000 World 5112 Stocks 1057 Chic… 2013 2013
7 QA Live … 5000 World 5112 Stocks 1057 Chic… 2012 2012
8 QA Live … 5000 World 5112 Stocks 1057 Chic… 2010 2010
9 QA Live … 5000 World 5112 Stocks 1057 Chic… 2011 2011
10 QA Live … 5000 World 5112 Stocks 1057 Chic… 2009 2009
# … with 30,967 more rows, 4 more variables: Unit <chr>, Value <dbl>,
# Flag <chr>, `Flag Description` <chr>, and abbreviated variable names
# ¹`Domain Code`, ²`Area Code`, ³`Element Code`, ⁴`Item Code`, ⁵`Year Code`
Data set gives information about number of stocks of 5 types of birds (Chickens; Ducks; Geese and guinea fowls; Pigeons, other birds; Turkeys). We have information about their existence and the quantity in different geographic areas (countries, continents, world) from 1961 to 2018. The data set contains 30977 rows, so we have 30977 cases. Each case contains the name of an animal (‘Item’), geographic region (‘Area’) and year (‘Year’). Columns ‘Unit’ and ‘Value’ give information about the number of certain type of birds. ‘Unit’ is 1000 heads and ‘Value’ contains numbers from 0 (min) to 23707134 (max, refers to world). We also observe columns ‘Flag’ and ‘Flag. Description’ which probably refer to data source or the way data was gathered.
birds<-birds%>%
filter(Flag == c("M", "Im", "F","*", "(Empty string)"))%>%
filter(Item == 'Chickens')
birds%>%
group_by(Year)%>%
select("Value")%>%
summarise(Median = median (Value, na.rm = TRUE), Mean = mean (Value, na.rm = TRUE), SD = sd (Value, na.rm = TRUE), Min = min (Value, na.rm = TRUE), Max = max (Value, na.rm = TRUE))
# A tibble: 58 × 6
Year Median Mean SD Min Max
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1961 2000 31861. 120751. 2 530000
2 1962 1525 13065. 32937. 35 170000
3 1963 2650 39482. 161735. 10 780000
4 1964 2500 8315. 22758. 42 106500
5 1965 1095 3110 4125. 2 12000
6 1966 2600 41534. 130255. 3 587000
7 1967 970 8777. 26745. 27 138000
8 1968 4100 7884. 10644. 10 44500
9 1969 4250 58425. 208698. 75 912700
10 1970 575 3367. 5163. 2 16700
# … with 48 more rows
# A tibble: 290 × 2
# Groups: Year [58]
Year Quantile
<dbl> <dbl>
1 1961 2
2 1961 182.
3 1961 2000
4 1961 6400
5 1961 530000
6 1962 35
7 1962 165
8 1962 1525
9 1962 5150
10 1962 170000
# … with 280 more rows
# A tibble: 181 × 6
Area Median Mean SD Min Max
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan 6300 6180 719. 5000 6800
2 Albania 2800 2800 NA 2800 2800
3 Algeria 101500 90265. 45927. 9100 135018
4 American Samoa 39 42.8 15.9 25 80
5 Angola 5700 5471. 1155. 3600 6940
6 Antigua and Barbuda 90 92.8 34.6 50 150
7 Argentina 82500 79125 21006. 38000 105000
8 Armenia 3835 3835 49.5 3800 3870
9 Aruba NA NaN NA Inf -Inf
10 Australia 21145 21145 1209. 20290 22000
# … with 171 more rows
# A tibble: 905 × 2
# Groups: Area [181]
Area Quantile
<chr> <dbl>
1 Afghanistan 5000
2 Afghanistan 6100
3 Afghanistan 6300
4 Afghanistan 6700
5 Afghanistan 6800
6 Albania 2800
7 Albania 2800
8 Albania 2800
9 Albania 2800
10 Albania 2800
# … with 895 more rows
First group organized by year. In the first table we can observe mean, median, sd, max, min of number of chicken stocks in each year. Quantile presented in the second table. In the third and fourth table, there are central tendency and dispersion for data grouped by area.
Summary statistics for the first group gives information on how the central tendency changed during observed years. We have only one type of birds, so we do not compare different birds, but look at how the number of stocks of chickens changed. Summary for the second group shows how central tendency and dispersion differs among countries.
---
title: "Challenge 2"
author: "Mariia Dubyk"
desription: "Data wrangling: using group() and summarise()"
date: "10/05/2022"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_2
- railroads
- faostat
- hotel_bookings
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Read in the Data
```{r}
birds<-read_csv("_data/birds.csv")
```
```{r}
#| label: summary
rename(birds,Domain.Code=п.їDomain.Code)
birds <- birds [,c("Domain", "Area", "Element", "Item", "Year", "Unit", "Value", "Flag", "Flag.Description")]
summary(birds)
library(summarytools)
print(dfSummary(birds))
view(dfSummary(birds))
arrange(birds, desc(Value))
```
## Describe the data
Data set gives information about number of stocks of 5 types of birds (Chickens; Ducks; Geese and guinea fowls; Pigeons, other birds; Turkeys). We have information about their existence and the quantity in different geographic areas (countries, continents, world) from 1961 to 2018. The data set contains 30977 rows, so we have 30977 cases. Each case contains the name of an animal ('Item'), geographic region ('Area') and year ('Year'). Columns 'Unit' and 'Value' give information about the number of certain type of birds. 'Unit' is 1000 heads and 'Value' contains numbers from 0 (min) to 23707134 (max, refers to world). We also observe columns ‘Flag’ and ‘Flag. Description’ which probably refer to data source or the way data was gathered.
```{r}
birds<-birds%>%
filter(Flag == c("M", "Im", "F","*", "(Empty string)"))%>%
filter(Item == 'Chickens')
birds%>%
group_by(Year)%>%
select("Value")%>%
summarise(Median = median (Value, na.rm = TRUE), Mean = mean (Value, na.rm = TRUE), SD = sd (Value, na.rm = TRUE), Min = min (Value, na.rm = TRUE), Max = max (Value, na.rm = TRUE))
```
```{r}
birds%>%
group_by(Year)%>%
select("Value")%>%
summarise(Quantile = quantile (Value, na.rm = TRUE))
```
```{r}
birds%>%
group_by(Area)%>%
summarise(Median = median(Value, na.rm = TRUE), Mean = mean(Value, na.rm = TRUE), SD = sd (Value, na.rm = TRUE), Min = min (Value, na.rm = TRUE), Max = max (Value, na.rm = TRUE))
```
```{r}
birds%>%
group_by(Area)%>%
select("Value")%>%
summarise(Quantile = quantile (Value, na.rm = TRUE))
```
## Provide Grouped Summary Statistics
- First, I filtered data to look at number of Chickens in different areas from 1961 to 2018.
- I also removed continents to have only countries.
- I organized data in two groups (By year and by area).
First group organized by year. In the first table we can observe mean, median, sd, max, min of number of chicken stocks in each year. Quantile presented in the second table. In the third and fourth table, there are central tendency and dispersion for data grouped by area.
### Explain and Interpret
Summary statistics for the first group gives information on how the central tendency changed during observed years. We have only one type of birds, so we do not compare different birds, but look at how the number of stocks of chickens changed. Summary for the second group shows how central tendency and dispersion differs among countries.