challenge_1
my name
dataset
Reading in data and creating a post
Author

Abhinav Reddy Yadatha

Published

February 26, 2023

Code
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to

  1. read in a dataset, and

  2. describe the dataset using both words and any supporting information (e.g., tables, etc)

Read in the Data

Read in one (or more) of the following data sets, using the correct R package and command.

  • railroad_2012_clean_county.csv ⭐
  • birds.csv ⭐⭐
  • FAOstat*.csv ⭐⭐
  • wild_bird_data.xlsx ⭐⭐⭐
  • StateCounty2012.xls ⭐⭐⭐⭐

Find the _data folder, located inside the posts folder. Then you can read in the data, using either one of the readr standard tidy read commands, or a specialized package such as readxl.

Code
dataframe <- read_csv('_data/birds.csv', show_col_types = FALSE)
head(dataframe)
# A tibble: 6 × 14
  Domai…¹ Domain Area …² Area  Eleme…³ Element Item …⁴ Item  Year …⁵  Year Unit 
  <chr>   <chr>    <dbl> <chr>   <dbl> <chr>     <dbl> <chr>   <dbl> <dbl> <chr>
1 QA      Live …       2 Afgh…    5112 Stocks     1057 Chic…    1961  1961 1000…
2 QA      Live …       2 Afgh…    5112 Stocks     1057 Chic…    1962  1962 1000…
3 QA      Live …       2 Afgh…    5112 Stocks     1057 Chic…    1963  1963 1000…
4 QA      Live …       2 Afgh…    5112 Stocks     1057 Chic…    1964  1964 1000…
5 QA      Live …       2 Afgh…    5112 Stocks     1057 Chic…    1965  1965 1000…
6 QA      Live …       2 Afgh…    5112 Stocks     1057 Chic…    1966  1966 1000…
# … with 3 more variables: Value <dbl>, Flag <chr>, `Flag Description` <chr>,
#   and abbreviated variable names ¹​`Domain Code`, ²​`Area Code`,
#   ³​`Element Code`, ⁴​`Item Code`, ⁵​`Year Code`

Add any comments or documentation as needed. More challenging data sets may require additional code chunks and documentation.

Describe the data

Using a combination of words and results of R commands, can you provide a high level description of the data? Describe as efficiently as possible where/how the data was (likely) gathered, indicate the cases and variables (both the interpretation and any details you deem useful to the reader to fully understand your chosen data).

Description : The dataset ‘birds.csv’ contains information about the population of wild birds like chicken, geese etc for a few countries anually from 1961 to 2018.

Code
#| Displaying the first few rows.
head(dataframe)
# A tibble: 6 × 14
  Domai…¹ Domain Area …² Area  Eleme…³ Element Item …⁴ Item  Year …⁵  Year Unit 
  <chr>   <chr>    <dbl> <chr>   <dbl> <chr>     <dbl> <chr>   <dbl> <dbl> <chr>
1 QA      Live …       2 Afgh…    5112 Stocks     1057 Chic…    1961  1961 1000…
2 QA      Live …       2 Afgh…    5112 Stocks     1057 Chic…    1962  1962 1000…
3 QA      Live …       2 Afgh…    5112 Stocks     1057 Chic…    1963  1963 1000…
4 QA      Live …       2 Afgh…    5112 Stocks     1057 Chic…    1964  1964 1000…
5 QA      Live …       2 Afgh…    5112 Stocks     1057 Chic…    1965  1965 1000…
6 QA      Live …       2 Afgh…    5112 Stocks     1057 Chic…    1966  1966 1000…
# … with 3 more variables: Value <dbl>, Flag <chr>, `Flag Description` <chr>,
#   and abbreviated variable names ¹​`Domain Code`, ²​`Area Code`,
#   ³​`Element Code`, ⁴​`Item Code`, ⁵​`Year Code`

Displaying the summary of the dataset.

Code
library(summarytools)
dfSummary(dataframe)
Data Frame Summary  
dataframe  
Dimensions: 30977 x 14  
Duplicates: 0  

----------------------------------------------------------------------------------------------------------------------------
No   Variable           Stats / Values                   Freqs (% of Valid)      Graph                  Valid      Missing  
---- ------------------ -------------------------------- ----------------------- ---------------------- ---------- ---------
1    Domain Code        1. QA                            30977 (100.0%)          IIIIIIIIIIIIIIIIIIII   30977      0        
     [character]                                                                                        (100.0%)   (0.0%)   

2    Domain             1. Live Animals                  30977 (100.0%)          IIIIIIIIIIIIIIIIIIII   30977      0        
     [character]                                                                                        (100.0%)   (0.0%)   

3    Area Code          Mean (sd) : 1201.7 (2099.4)      248 distinct values     :                      30977      0        
     [numeric]          min < med < max:                                         :                      (100.0%)   (0.0%)   
                        1 < 156 < 5504                                           :                                          
                        IQR (CV) : 152 (1.7)                                     :                 .                        
                                                                                 :                 :                        

4    Area               1. Africa                          290 ( 0.9%)                                  30977      0        
     [character]        2. Asia                            290 ( 0.9%)                                  (100.0%)   (0.0%)   
                        3. Eastern Asia                    290 ( 0.9%)                                                      
                        4. Egypt                           290 ( 0.9%)                                                      
                        5. Europe                          290 ( 0.9%)                                                      
                        6. France                          290 ( 0.9%)                                                      
                        7. Greece                          290 ( 0.9%)                                                      
                        8. Myanmar                         290 ( 0.9%)                                                      
                        9. Northern Africa                 290 ( 0.9%)                                                      
                        10. South-eastern Asia             290 ( 0.9%)                                                      
                        [ 238 others ]                   28077 (90.6%)           IIIIIIIIIIIIIIIIII                         

5    Element Code       1 distinct value                 5112 : 30977 (100.0%)   IIIIIIIIIIIIIIIIIIII   30977      0        
     [numeric]                                                                                          (100.0%)   (0.0%)   

6    Element            1. Stocks                        30977 (100.0%)          IIIIIIIIIIIIIIIIIIII   30977      0        
     [character]                                                                                        (100.0%)   (0.0%)   

7    Item Code          Mean (sd) : 1066.5 (9)           1057 : 13074 (42.2%)    IIIIIIII               30977      0        
     [numeric]          min < med < max:                 1068 :  6909 (22.3%)    IIII                   (100.0%)   (0.0%)   
                        1057 < 1068 < 1083               1072 :  4136 (13.4%)    II                                         
                        IQR (CV) : 15 (0)                1079 :  5693 (18.4%)    III                                        
                                                         1083 :  1165 ( 3.8%)                                               

8    Item               1. Chickens                      13074 (42.2%)           IIIIIIII               30977      0        
     [character]        2. Ducks                          6909 (22.3%)           IIII                   (100.0%)   (0.0%)   
                        3. Geese and guinea fowls         4136 (13.4%)           II                                         
                        4. Pigeons, other birds           1165 ( 3.8%)                                                      
                        5. Turkeys                        5693 (18.4%)           III                                        

9    Year Code          Mean (sd) : 1990.6 (16.7)        58 distinct values      . . .   . :   : : :    30977      0        
     [numeric]          min < med < max:                                         : : : . : : : : : :    (100.0%)   (0.0%)   
                        1961 < 1992 < 2018                                       : : : : : : : : : :                        
                        IQR (CV) : 29 (0)                                        : : : : : : : : : :                        
                                                                                 : : : : : : : : : :                        

10   Year               Mean (sd) : 1990.6 (16.7)        58 distinct values      . . .   . :   : : :    30977      0        
     [numeric]          min < med < max:                                         : : : . : : : : : :    (100.0%)   (0.0%)   
                        1961 < 1992 < 2018                                       : : : : : : : : : :                        
                        IQR (CV) : 29 (0)                                        : : : : : : : : : :                        
                                                                                 : : : : : : : : : :                        

11   Unit               1. 1000 Head                     30977 (100.0%)          IIIIIIIIIIIIIIIIIIII   30977      0        
     [character]                                                                                        (100.0%)   (0.0%)   

12   Value              Mean (sd) : 99410.6 (720611.4)   11495 distinct values   :                      29941      1036     
     [numeric]          min < med < max:                                         :                      (96.7%)    (3.3%)   
                        0 < 1800 < 23707134                                      :                                          
                        IQR (CV) : 15233 (7.2)                                   :                                          
                                                                                 :                                          

13   Flag               1. *                              1494 ( 7.4%)           I                      20204      10773    
     [character]        2. A                              6488 (32.1%)           IIIIII                 (65.2%)    (34.8%)  
                        3. F                             10007 (49.5%)           IIIIIIIII                                  
                        4. Im                             1213 ( 6.0%)           I                                          
                        5. M                              1002 ( 5.0%)                                                      

14   Flag Description   1. Aggregate, may include of      6488 (20.9%)           IIII                   30977      0        
     [character]        2. Data not available             1002 ( 3.2%)                                  (100.0%)   (0.0%)   
                        3. FAO data based on imputat      1213 ( 3.9%)                                                      
                        4. FAO estimate                  10007 (32.3%)           IIIIII                                     
                        5. Official data                 10773 (34.8%)           IIIIII                                     
                        6. Unofficial figure              1494 ( 4.8%)                                                      
----------------------------------------------------------------------------------------------------------------------------

Checking the dimensions of the dataset:

Code
dim(dataframe)
[1] 30977    14
Code
#| 

It can be observed that there are 30977 rows and 14 columns

Displaying the column names of the dataset :

Code
colnames(dataframe)
 [1] "Domain Code"      "Domain"           "Area Code"        "Area"            
 [5] "Element Code"     "Element"          "Item Code"        "Item"            
 [9] "Year Code"        "Year"             "Unit"             "Value"           
[13] "Flag"             "Flag Description"
Code
#| The dataset has 14 coulmns describing various fields such as above.

Number of unique years :

Code
unique_years <- dataframe%>% select(Year)%>% n_distinct(.)
unique_years
[1] 58

It can be observed that the dataset has the data for 58 unique years i.e 1961-2018

Number of unique wildbirds :

Code
unique_birds <- dataframe%>% select(Item)%>% n_distinct(.)
unique_birds
[1] 5

It can be observed that the dataset contains information about 5 different types of birds.

Number of unique Areas / countries:

Code
unique_areas <- dataframe%>% select(Area)%>% n_distinct(.)
unique_areas
[1] 248

It can be observed that the dataset contains information about 248 different areas / countires